Spark Scala: How to transform a column in a DF - scala

I have a dataframe in Spark with many columns and a udf that I defined. I want the same dataframe back, except with one column transformed. Furthermore, my udf takes in a string and returns a timestamp. Is there an easy way to do this? I tried
val test = myDF.select("my_column").rdd.map(r => getTimestamp(r))
but this returns an RDD and just with the transformed column.

If you really need to use your function, I can suggest two options:
Using map / toDF:
import org.apache.spark.sql.Row
import sqlContext.implicits._
def getTimestamp: (String => java.sql.Timestamp) = // your function here
val test = myDF.select("my_column").rdd.map {
case Row(string_val: String) => (string_val, getTimestamp(string_val))
}.toDF("my_column", "new_column")
Using UDFs (UserDefinedFunction):
import org.apache.spark.sql.functions._
def getTimestamp: (String => java.sql.Timestamp) = // your function here
val newCol = udf(getTimestamp).apply(col("my_column")) // creates the new column
val test = myDF.withColumn("new_column", newCol) // adds the new column to original DF
Alternatively,
If you just want to transform a StringType column into a TimestampType column you can use the unix_timestamp column function available since Spark SQL 1.5:
val test = myDF
.withColumn("new_column", unix_timestamp(col("my_column"), "yyyy-MM-dd HH:mm")
.cast("timestamp"))
Note: For spark 1.5.x, it is necessary to multiply the result of unix_timestamp by 1000 before casting to timestamp (issue SPARK-11724). The resulting code would be:
val test = myDF
.withColumn("new_column", (unix_timestamp(col("my_column"), "yyyy-MM-dd HH:mm") *1000L)
.cast("timestamp"))
Edit: Added udf option

Related

Combining two columns, casting two timestamp and selecting from df causes no error, but casting one column to timestamp and selecting causes error

Description
When I try to select a column that is cast to unix_timestamp and then timestamp from a dataframe there is a sparkanalysisexception error. See link below.
However, when I combine two columns, and then cast the combo to a unix_timestamp and then timestamp type and then select from a df there is no error.
Disparate Cases
Error:
How to extract year from a date string?
No Error
import org.apache.spark.sql.functions._
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val spark: SparkSession = SparkSession.builder().
appName("myapp").master("local").getOrCreate()
case class Person(id: Int, date: String, time:String)
import spark.implicits._
val mydf: DataFrame = Seq(Person(1,"9/16/13", "11:11:11")).toDF()
//solution.show()
//column modificaton
val datecol: Column = mydf("date")
val timecol: Column = mydf("time")
val newcol: Column = unix_timestamp(concat(datecol,lit(" "),timecol),"MM/dd/yy").cast(TimestampType)
mydf.select(newcol).show()
Results
Expected:
Error-sparkanalysis, can't find unix_timestamp(concat(....)) in mydf
Actual:
+------------------------------------------------------------------+
|CAST(unix_timestamp(concat(date, , time), MM/dd/yy) AS TIMESTAMP)|
+------------------------------------------------------------------+
| 2013-09-16 00:00:...|
These do not seem disparate cases. In the erroneous case, you had a new dataframe with changed column names. See below :-
val select_df: DataFrame = mydf.select(unix_timestamp(mydf("date"),"MM/dd/yy").cast(TimestampType))
select_df.select(year($"date")).show()
Here, select_df dataframe has changed column names from date to something like cast(unix_timestamp(mydf("date"),"MM/dd/yy")) as Timestamp
While in the case mentioned above, you are just defining a new column when you say :-
val newcol: Column = unix_timestamp(concat(datecol,lit(" "),timecol),"MM/dd/yy").cast(TimestampType)
And then you use this to select from your dataframe and thus it gives out expected results.
Hope this makes things clearer.

How to group by on epoch timestame field in Scala spark

I want to group by the records by date. but the date is in epoch timestamp in millisec.
Here is the sample data.
date, Col1
1506838074000, a
1506868446000, b
1506868534000, c
1506869064000, a
1506869211000, c
1506871846000, f
1506874462000, g
1506879651000, a
Here is what I'm trying to achieve.
**date Count of records**
02-10-2017 4
04-10-2017 3
03-10-2017 5
Here is the code which I tried to group by,
import java.text.SimpleDateFormat
val dateformat:SimpleDateFormat = new SimpleDateFormat("yyyy-MM-dd")
val df = sqlContext.read.csv("<path>")
val result = df.select("*").groupBy(dateformat.format($"date".toLong)).agg(count("*").alias("cnt")).select("date","cnt")
But while executing code I am getting below exception.
<console>:30: error: value toLong is not a member of org.apache.spark.sql.ColumnName
val t = df.select("*").groupBy(dateformat.format($"date".toLong)).agg(count("*").alias("cnt")).select("date","cnt")
Please help me to resolve the issue.
you would need to change the date column, which seems to be in long, to date data type. This can be done by using from_unixtime built-in function. And then its just a groupBy and agg function calls and use count function.
import org.apache.spark.sql.functions._
def stringDate = udf((date: Long) => new java.text.SimpleDateFormat("dd-MM-yyyy").format(date))
df.withColumn("date", stringDate($"date"))
.groupBy("date")
.agg(count("Col1").as("Count of records"))
.show(false)
Above answer is using udf function which should be avoided as much as possible, since udf is a black box and requires serialization and deserialisation of columns.
Updated
Thanks to #philantrovert for his suggestion to divide by 1000
import org.apache.spark.sql.functions._
df.withColumn("date", from_unixtime($"date"/1000, "yyyy-MM-dd"))
.groupBy("date")
.agg(count("Col1").as("Count of records"))
.show(false)
Both ways work.

Spark Select with a List of Columns Scala

I am trying to find a good way of doing a spark select with a List[Column, I am exploding a column than passing back all the columns I am interested in with my exploded column.
var columns = getColumns(x) // Returns a List[Column]
tempDf.select(columns) //trying to get
Trying to find a good way of doing this I know, if it were a string I could do something like
val result = dataframe.select(columnNames.head, columnNames.tail: _*)
For spark 2.0 seems that you have two options. Both depends on how you manage your columns (Strings or Columns).
Spark code (spark-sql_2.11/org/apache/spark/sql/Dataset.scala):
def select(cols: Column*): DataFrame = withPlan {
Project(cols.map(_.named), logicalPlan)
}
def select(col: String, cols: String*): DataFrame = select((col +: cols).map(Column(_)) : _*)
You can see how internally spark is converting your head & tail to a list of Columns to call again Select.
So, in that case if you want a clear code I will recommend:
If columns: List[String]:
import org.apache.spark.sql.functions.col
df.select(columns.map(col): _*)
Otherwise, if columns: List[Columns]:
df.select(columns: _*)

Spark: How can DataFrame be Dataset[Row] if DataFrame's have a schema

This article claims that a DataFrame in Spark is equivalent to a Dataset[Row], but this blog post shows that a DataFrame has a schema.
Take the example in the blog post of converting an RDD to a DataFrame: if DataFrame were the same thing as Dataset[Row], then converting an RDD to a DataFrameshould be as simple
val rddToDF = rdd.map(value => Row(value))
But instead it shows that it's this
val rddStringToRowRDD = rdd.map(value => Row(value))
val dfschema = StructType(Array(StructField("value",StringType)))
val rddToDF = sparkSession.createDataFrame(rddStringToRowRDD,dfschema)
val rDDToDataSet = rddToDF.as[String]
Clearly a dataframe is actually a dataset of rows and a schema.
In Spark 2.0, in code there is:
type DataFrame = Dataset[Row]
It is Dataset[Row], just because of definition.
Dataset has also schema, you can print it using printSchema() function. Normally Spark infers schema, so you don't have to write it by yourself - however it's still there ;)
You can also do createTempView(name) and use it in SQL queries, just like DataFrames.
In other words, Dataset = DataFrame from Spark 1.5 + encoder, that converts rows to your classes. After merging types in Spark 2.0, DataFrame becomes just an alias for Dataset[Row], so without specified encoder.
About conversions: rdd.map() also returns RDD, it never returns DataFrame. You can do:
// Dataset[Row]=DataFrame, without encoder
val rddToDF = sparkSession.createDataFrame(rdd)
// And now it has information, that encoder for String should be used - so it becomes Dataset[String]
val rDDToDataSet = rddToDF.as[String]
// however, it can be shortened to:
val dataset = sparkSession.createDataset(rdd)
Note (in addition to the answer of T Gaweda) that there is a schema associated to each Row (Row.schema). However, this schema is not set until it is integrated in a DataFrame (or Dataset[Row])
scala> Row(1).schema
res12: org.apache.spark.sql.types.StructType = null
scala> val rdd = sc.parallelize(List(Row(1)))
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[5] at parallelize at <console>:28
scala> spark.createDataFrame(rdd,schema).first
res15: org.apache.spark.sql.Row = [1]
scala> spark.createDataFrame(rdd,schema).first.schema
res16: org.apache.spark.sql.types.StructType = StructType(StructField(a,IntegerType,true))

How to sum the values of one column of a dataframe in spark/scala

I have a Dataframe that I read from a CSV file with many columns like: timestamp, steps, heartrate etc.
I want to sum the values of each column, for instance the total number of steps on "steps" column.
As far as I see I want to use these kind of functions:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$
But I can understand how to use the function sum.
When I write the following:
val df = CSV.load(args(0))
val sumSteps = df.sum("steps")
the function sum cannot be resolved.
Do I use the function sum wrongly?
Do Ι need to use first the function map? and if yes how?
A simple example would be very helpful! I started writing Scala recently.
You must first import the functions:
import org.apache.spark.sql.functions._
Then you can use them like this:
val df = CSV.load(args(0))
val sumSteps = df.agg(sum("steps")).first.get(0)
You can also cast the result if needed:
val sumSteps: Long = df.agg(sum("steps").cast("long")).first.getLong(0)
Edit:
For multiple columns (e.g. "col1", "col2", ...), you could get all aggregations at once:
val sums = df.agg(sum("col1").as("sum_col1"), sum("col2").as("sum_col2"), ...).first
Edit2:
For dynamically applying the aggregations, the following options are available:
Applying to all numeric columns at once:
df.groupBy().sum()
Applying to a list of numeric column names:
val columnNames = List("col1", "col2")
df.groupBy().sum(columnNames: _*)
Applying to a list of numeric column names with aliases and/or casts:
val cols = List("col1", "col2")
val sums = cols.map(colName => sum(colName).cast("double").as("sum_" + colName))
df.groupBy().agg(sums.head, sums.tail:_*).show()
If you want to sum all values of one column, it's more efficient to use DataFrame's internal RDD and reduce.
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val df = sc.parallelize(Array(10,2,3,4)).toDF("steps")
df.select(col("steps")).rdd.map(_(0).asInstanceOf[Int]).reduce(_+_)
//res1 Int = 19
Simply apply aggregation function, Sum on your column
df.groupby('steps').sum().show()
Follow the Documentation http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html
Check out this link also https://www.analyticsvidhya.com/blog/2016/10/spark-dataframe-and-operations/
Not sure this was around when this question was asked but:
df.describe().show("columnName")
gives mean, count, stdtev stats on a column. I think it returns on all columns if you just do .show()
Using spark sql query..just incase if it helps anyone!
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf
import org.apache.spark.sql.functions._
import org.apache.spark.SparkContext
import java.util.stream.Collectors
val conf = new SparkConf().setMaster("local[2]").setAppName("test")
val spark = SparkSession.builder.config(conf).getOrCreate()
val df = spark.sparkContext.parallelize(Seq(1, 2, 3, 4, 5, 6, 7)).toDF()
df.createOrReplaceTempView("steps")
val sum = spark.sql("select sum(steps) as stepsSum from steps").map(row => row.getAs("stepsSum").asInstanceOf[Long]).collect()(0)
println("steps sum = " + sum) //prints 28