Take column names from old dataframe in Spark Scala - scala

See my code:
val spark = SparkSession.builder
.master("local[*]")
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.getOrCreate()
val data = spark.read.option("header", "true")
.option("inferSchema", "true")
.csv("src/main/resources/student.csv")
My data looks like:
Id Name City
1 Ali lhr
2 abc khi
3 xyz isb
Now I create a new DataFrame:
val someDF = Seq(
(4,"Ahmad","swl")
).toDF("Id", "Name","City")
Here you can see I have created a new DataFrame someDF with same column name as old DataFrame data. But I have assigned names manually to the new DataFrame someDF. My question is that is there any method that can take column names from old DataFrame and assign it to new DataFrame programmatically.
Something like
val featureCols= data.columns

2 ways to do it pass column array as varargs i.e. (data.columns:_*) & union. below is the full example.
val csv =
"""
|Id,Name, City
|1,Ali,lhr
|2,abc,khi
|3,xyz,isb
""".stripMargin.lines.toSeq.toDS()
//*** Option1***
val data: DataFrame = spark.read.option("header", true)
.option("sep", ",")
.option("inferSchema", true)
.csv(csv)
data.show
val someDF: DataFrame = Seq(
(4,"Ahmad","swl")
).toDF(data.columns:_*)
someDF.show
//***Option 2***
val someDF1: DataFrame = Seq(
(4,"Ahmad","swl")
).toDF
data.limit(0).union(someDF1).show
Result :
+---+----+------+
| Id|Name| City|
+---+----+------+
| 1| Ali| lhr|
| 2| abc| khi|
| 3| xyz| isb|
+---+----+------+
+---+-----+------+
| Id| Name| City|
+---+-----+------+
| 4|Ahmad| swl|
+---+-----+------+
+---+-----+------+
| Id| Name| City|
+---+-----+------+
| 4|Ahmad| swl|
+---+-----+------+

.toDF accepts (colNames: String*) , We can unnest List[String] as strings with :_*
Example:
val featureCols=Seq("Id","Name","City")
val someDF = Seq((4,"Ahmad","swl").toDF(cols:_*)
Seq(("1","2","3")).toDF(featureCols:_*).show()
//+---+----+----+
//| Id|Name|City|
//+---+----+----+
//| 1| 2| 3|
//+---+----+----+

Related

Scala: Find the maximum value across each row of a dataframe

For each row of a DataFrame, I would like to extract the maximum value and put it in a new column.
The example code below gives me a DataFrame ('dfmax') of each maximum value:
val donuts = Seq((2.0, 1.50, 3.5), (4.2, 22.3, 10.8), (33.6, 2.50, 7.3))
val df = sparkSession
.createDataFrame(donuts)
.toDF("col1", "col2", "col3")
df.show()
import sparkSession.implicits._
val dfmax = df.map(r => r.getValuesMap[Double](df.schema.fieldNames).map(r => r._2).max)
dfmax.show
This gives me df:
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 2.0| 1.5| 3.5|
| 4.2|22.3|10.8|
|33.6| 2.5| 7.3|
+----+----+----+
and dfmax:
+-----+
|value|
+-----+
| 3.5|
| 22.3|
| 33.6|
+-----+
I would like to have these two frames combined in one table preferably using .withColumn or similar in a style like this (which I cannot get to work):
def maxValue(data: DataFrame): DataFrame = {
val dfmax = df.map(r => r.getValuesMap[Double](df.schema.fieldNames).map(r => r._2).max)
dfmax
}
val udfMaxValue = udf(maxValue _)
df.withColumn("max", udfMaxValue(df))

Convert Date Column to Age with Scala and Spark

I am trying to convert a Column of a Dataset to true Age.
I am using Scala with Spark and my project is on IntelliJ.
This is the sample dataset
TotalCost|BirthDate|Gender|TotalChildren|ProductCategoryName
1000||Male|2|Technology
2000|1957-03-06||3|Beauty
3000|1959-03-06|Male||Car
4000|1953-03-06|Male|2|
5000|1957-03-06|Female|3|Beauty
6000|1959-03-06|Male|4|Car
7000|1957-03-06|Female|3|Beauty
8000|1959-03-06|Male|4|Car
And this is the code of Scala
import org.apache.spark.sql.SparkSession
object DataFrameFromCSVFile2 {
def main(args:Array[String]):Unit= {
val spark: SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExample")
.getOrCreate()
val filePath="src/main/resources/demodata.txt"
val df = spark.read.options(Map("inferSchema"->"true","delimiter"->"|","header"->"true")).csv(filePath).select("Gender", "BirthDate", "TotalCost", "TotalChildren", "ProductCategoryName")
val df2 = df
.filter("Gender is not null")
.filter("BirthDate is not null")
.filter("TotalChildren is not null")
.filter("ProductCategoryName is not null")
df2.show()
So I am trying to convert the 1957-03-06 to an age like 61 in the Column
Any idea will help a lot
Thank you very much
You can use the built-in functions - months_between() or datediff(). Check this out
scala> val df = Seq("1957-03-06","1959-03-06").toDF("date")
df: org.apache.spark.sql.DataFrame = [date: string]
scala> df.show(false)
+----------+
|date |
+----------+
|1957-03-06|
|1959-03-06|
+----------+
scala> df.withColumn("age",months_between(current_date,'date)/12).show
+----------+------------------+
| date| age|
+----------+------------------+
|1957-03-06|61.806451612500005|
|1959-03-06|59.806451612500005|
+----------+------------------+
scala> df.withColumn("age",datediff(current_date,'date)/365).show
+----------+-----------------+
| date| age|
+----------+-----------------+
|1957-03-06|61.85205479452055|
|1959-03-06|59.85205479452055|
+----------+-----------------+
scala>
Here's one way that uses the java.time API in an UDF along with Spark's built-in when/otherwise for null check:
val currentAge = udf{ (dob: java.sql.Date) =>
import java.time.{LocalDate, Period}
Period.between(dob.toLocalDate, LocalDate.now).getYears
}
df.withColumn("CurrentAge", when($"BirthDate".isNotNull, currentAge($"BirthDate"))).
show(5)
// +------+-------------------+---------+-------------+-------------------+----------+
// |Gender| BirthDate|TotalCost|TotalChildren|ProductCategoryName|CurrentAge|
// +------+-------------------+---------+-------------+-------------------+----------+
// | Male| null| 1000| 2| Technology| null|
// | null|1957-03-06 00:00:00| 2000| 3| Beauty| 61|
// | Male|1959-03-06 00:00:00| 3000| null| Car| 59|
// | Male|1953-03-06 00:00:00| 4000| 2| null| 65|
// |Female|1957-03-06 00:00:00| 5000| 3| Beauty| 61|
// +------+-------------------+---------+-------------+-------------------+----------+
You can use the Java Calendar library to get the current date in your timezone to calculate the age. you can use udf to do that.
for example
import java.time.ZoneId
import java.util.Calendar
val data = Seq("1957-03-06","1959-03-06").toDF("date")
val ageudf = udf((inputDate:String)=>{
val format = new java.text.SimpleDateFormat("yyyy-MM-dd")
val birthDate = format.parse(inputDate).toInstant.atZone(ZoneId.systemDefault()).toLocalDate
val currentDate = Calendar.getInstance().getTime..toInstant.atZone(ZoneId.systemDefault()).toLocalDate
import java.time.Period
if((birthDate != null) && (currentDate != null)) Period.between(birthDate,currentDate).getYears
else 0
})
data.withColumn("age",ageUdf($"date")).show()
The output will be:
date|age
1957-03-06|61
1959-03-06|59

How to transform a string column of a dataframe into a column of Array[String] with Apache Spark and Scala

I have a DataFrame with a column 'title_from' as below.
.
This colume contains a sentence and I want to transform this column into a Array[String]. I have tried something like this but it does not works.
val newDF = df.select("title_from").map(x => x.split("\\\s+")
How can I achieve this? How can I transform a datafram of strings into a dataframe of Array[string]? I want evry line of newDF to be an array of words from df.
Thanks for any help!
You can use the withColumn function.
import org.apache.spark.sql.functions._
val newDF = df.withColumn("split_title_from", split(col("title_from"), "\\s+"))
.select("split_title_from")
Can you try following to get the list of all authors
scala> val df = Seq((1,"a1,a2,a3"), (2,"a1,a4,a10")).toDF("id","author")
df: org.apache.spark.sql.DataFrame = [id: int, author: string]
scala> df.show()
+---+---------+
| id| author|
+---+---------+
| 1| a1,a2,a3|
| 2|a1,a4,a10|
+---+---------+
scala> df.select("author").show
+---------+
| author|
+---------+
| a1,a2,a3|
|a1,a4,a10|
+---------+
scala> df.select("author").flatMap( row => { row.get(0).toString().split(",")}).show()
+-----+
|value|
+-----+
| a1|
| a2|
| a3|
| a1|
| a4|
| a10|
+-----+

How to point or select a cell in a dataframe, Spark - Scala

I want to find the time difference of 2 cells.
With arrays in python I would do a for loop the st[i+1] - st[i] and store the results somewhere.
I have this dataframe sorted by time. How can I do it with Spark 2 or Scala, a pseudo-code is enough.
+--------------------+-------+
| st| name|
+--------------------+-------+
|15:30 |dog |
|15:32 |dog |
|18:33 |dog |
|18:34 |dog |
+--------------------+-------+
If the sliding diffs are to be computed per partition by name, I would use the lag() Window function:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val df = Seq(
("a", 100), ("a", 120),
("b", 200), ("b", 240), ("b", 270)
).toDF("name", "value")
val window = Window.partitionBy($"name").orderBy("value")
df.
withColumn("diff", $"value" - lag($"value", 1).over(window)).
na.fill(0).
orderBy("name", "value").
show
// +----+-----+----+
// |name|value|diff|
// +----+-----+----+
// | a| 100| 0|
// | a| 120| 20|
// | b| 200| 0|
// | b| 240| 40|
// | b| 270| 30|
// +----+-----+----+
On the other hand, if the sliding diffs are to be computed across the entire dataset, Window function without partition wouldn't scale hence I would resort to using RDD's sliding() function:
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import org.apache.spark.mllib.rdd.RDDFunctions._
val rdd = df.rdd
val diffRDD = rdd.sliding(2).
map{ case Array(x, y) => Row(y.getString(0), y.getInt(1), y.getInt(1) - x.getInt(1)) }
val headRDD = sc.parallelize(Seq(Row.fromSeq(rdd.first.toSeq :+ 0)))
val headDF = spark.createDataFrame(headRDD, df.schema.add("diff", IntegerType))
val diffDF = spark.createDataFrame(diffRDD, df.schema.add("diff", IntegerType))
val resultDF = headDF union diffDF
resultDF.show
// +----+-----+----+
// |name|value|diff|
// +----+-----+----+
// | a| 100| 0|
// | a| 120| 20|
// | b| 200| 80|
// | b| 240| 40|
// | b| 270| 30|
// +----+-----+----+
Something like:
object Data1 {
import org.apache.log4j.Logger
import org.apache.log4j.Level
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)
def main(args: Array[String]) : Unit = {
implicit val spark: SparkSession =
SparkSession
.builder()
.appName("Test")
.master("local[1]")
.getOrCreate()
import org.apache.spark.sql.functions.col
val rows = Seq(Row(1, 1), Row(1, 1), Row(1, 1))
val schema = List(StructField("int1", IntegerType, true), StructField("int2", IntegerType, true))
val someDF = spark.createDataFrame(
spark.sparkContext.parallelize(rows),
StructType(schema)
)
someDF.withColumn("diff", col("int1") - col("int2")).show()
}
}
gives
+----+----+----+
|int1|int2|diff|
+----+----+----+
| 1| 1| 0|
| 1| 1| 0|
| 1| 1| 0|
+----+----+----+
If you are specifically looking to diff adjacent elements in a collection then in Scala I would zip the collection with its tail to give a collection containing tuples of adjacent pairs.
Unfortunately there isn't a tail method on RDDs or DataFrames/Sets
You could do something like:
val a = myDF.rdd
val tail = myDF.rdd.zipWithIndex.collect{
case (index, v) if index > 1 => v}
a.zip(tail).map{ case (l, r) => /* diff l and r st column */}.collect

Use a generated string in the select expr of dataframe

I am very new to scala programming, so this might be a basic question
I am planning to create a dataframe dynamically.
This is my end goal :
val df2 = df1.select("col1","col2","col3")
I have a function which generates these column names as below and saved to a variable like this :
scala> val colVar = generateColSelectionString(4)
colVar: String = col1,col2,col3
Now,
How do I do something like this:
val df2 = df1.select(colVar)
You can split the string and use selectExpr:
val df = Seq((1,2,3)).toDF("col1","col2","col3")
val colVar = "col1,col2,col3"
df.selectExpr(colVar.split(","):_*).show
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 2| 3|
+----+----+----+
Split "colVar" variable, and use "select" with two parameters:
val data = List(("v1", "v2", "v3"))
val df = sparkContext.parallelize(data).toDF("col1", "col2", "col3")
val colVar = "col1,col2,col3"
val columnList = colVar.split(",")
val result = df.select(columnList.head, columnList.tail: _*)
result.show(false)
Output:
+----+----+----+
|col1|col2|col3|
+----+----+----+
|v1 |v2 |v3 |
+----+----+----+