Using a case class to rename split columns with Spark Dataframe - scala

I am splitting 'split_column' to another five columns as per the following code. However I wanted to have this new columns to be renamed so that they would have some meaningful names(let's say new_renamed1", "new_renamed2", "new_renamed3", "new_renamed4", "new_renamed5" in this example)
val df1 = df.withColumn("new_column", split(col("split_column"), "\\|")).select(col("*") +: (0 until 5).map(i => col("new_column").getItem(i).as(s"newcol$i")): _*).drop("split_column","new_column")
val new_columns_renamed = Seq("....., "new_renamed1", "new_renamed2", "new_renamed3", "new_renamed4", "new_renamed5")
val df2 = df1.toDF(new_columns_renamed: _*)
However issue with this approach is some of my splits might have more than fifty new rows. In thi renaming approach, a little typo (like extra comma, missing double quotes) would be painful to detect.
Is there a way to rename columns with case class like below ?
case class SplittedRecord (new_renamed1: String, new_renamed2: String, new_renamed3: String, new_renamed4: String, new_renamed5: String)
Please note that in the actual scenario names would not look like new_renamed1, new_renamed2, ......, new_renamed5 , they would be totally different.

You could try something like this:
import org.apache.spark.sql.catalyst.ScalaReflection
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.Encoders
val names = Encoders.product[SplittedRecord].schema.fieldNames
names.zipWithIndex
.foldLeft(df.withColumn("new_column", split(col("split_column"), "\\|")))
{ case (df, (c, i)) => df.withColumn(c, $"new_column"(i)) }

One of the ways to use the case class
case class SplittedRecord (new_renamed1: String, new_renamed2: String, new_renamed3: String, new_renamed4: String, new_renamed5: String)
is through udf function as
import org.apache.spark.sql.functions._
def splitUdf = udf((array: Seq[String])=> SplittedRecord(array(0), array(1), array(2), array(3), array(4)))
df.withColumn("test", splitUdf(split(col("split_column"), "\\|"))).drop("split_column")
.select(col("*"), col("test.*")).drop("test")

Related

How to use non-column value in UserDefinedFunction (UDF) for adding a column to a DataFrame? [duplicate]

I want to parse the date columns in a DataFrame, and for each date column, the resolution for the date may change (i.e. 2011/01/10 => 2011 /01 if the resolution is set to "Month").
I wrote the following code:
def convertDataFrame(dataframe: DataFrame, schema : Array[FieldDataType], resolution: Array[DateResolutionType]) : DataFrame =
{
import org.apache.spark.sql.functions._
val convertDateFunc = udf{(x:String, resolution: DateResolutionType) => SparkDateTimeConverter.convertDate(x, resolution)}
val convertDateTimeFunc = udf{(x:String, resolution: DateResolutionType) => SparkDateTimeConverter.convertDateTime(x, resolution)}
val allColNames = dataframe.columns
val allCols = allColNames.map(name => dataframe.col(name))
val mappedCols =
{
for(i <- allCols.indices) yield
{
schema(i) match
{
case FieldDataType.Date => convertDateFunc(allCols(i), resolution(i)))
case FieldDataType.DateTime => convertDateTimeFunc(allCols(i), resolution(i))
case _ => allCols(i)
}
}
}
dataframe.select(mappedCols:_*)
}}
However it doesn't work. It seems that I can only pass Columns to UDFs. And I wonder if it will be very slow if I convert the DataFrame to RDD and apply the function on each row.
Does anyone know the correct solution? Thank you!
Just use a little bit of currying:
def convertDateFunc(resolution: DateResolutionType) = udf((x:String) =>
SparkDateTimeConverter.convertDate(x, resolution))
and use it as follows:
case FieldDataType.Date => convertDateFunc(resolution(i))(allCols(i))
On a side note you should take a look at sql.functions.trunc and sql.functions.date_format. These should at least part of the job without using UDFs at all.
Note:
In Spark 2.2 or later you can use typedLit function:
import org.apache.spark.sql.functions.typedLit
which support a wider range of literals like Seq or Map.
You can create a literal Column to pass to a udf using the lit(...) function defined in org.apache.spark.sql.functions
For example:
val takeRight = udf((s: String, i: Int) => s.takeRight(i))
df.select(takeRight($"stringCol", lit(1)))

how to convert RDD[(String, Any)] to Array(Row)?

I've got a unstructured RDD with keys and values. The values is of RDD[Any] and the keys are currently Strings, RDD[String] and mainly contain Maps. I would like to make them of type Row so I can make a dataframe eventually. Here is my rdd :
removed
Most of the rdd follows a pattern except for the last 4 keys, how should this be dealt with ? Perhaps split them into their own rdd, especially for reverseDeltas ?
Thanks
Edit
This is what I've tired so far based on the first answer below.
case class MyData(`type`: List[String], libVersion: Double, id: BigInt)
object MyDataBuilder{
def apply(s: Any): MyData = {
// read the input data and convert that to the case class
s match {
case Array(x: List[String], y: Double, z: BigInt) => MyData(x, y, z)
case Array(a: BigInt, Array(x: List[String], y: Double, z: BigInt)) => MyData(x, y, z)
case _ => null
}
}
}
val parsedRdd: RDD[MyData] = rdd.map(x => MyDataBuilder(x))
how it doesn't see to match any of those cases, how can I match on Map in scala ? I keep getting nulls back when printing out parsedRdd
To convert the RDD to a dataframe you need to have fixed schema. If you define the schema for the RDD rest is simple.
something like
val rdd2:RDD[Array[String]] = rdd.map( x => getParsedRow(x))
val rddFinal:RDD[Row] = rdd2.map(x => Row.fromSeq(x))
Alternate
case class MyData(....) // all the fields of the Schema I want
object MyDataBuilder {
def apply(s:Any):MyData ={
// read the input data and convert that to the case class
}
}
val rddFinal:RDD[MyData] = rdd.map(x => MyDataBuilder(x))
import spark.implicits._
val myDF = rddFinal.toDF
there is a method for converting an rdd to dataframe
use it like below
val rdd = sc.textFile("/pathtologfile/logfile.txt")
val df = rdd.toDF()
no you have dataframe do what ever you want on it using sql queries like below
val textFile = sc.textFile("hdfs://...")
// Creates a DataFrame having a single column named "line"
val df = textFile.toDF("line")
val errors = df.filter(col("line").like("%ERROR%"))
// Counts all the errors
errors.count()
// Counts errors mentioning MySQL
errors.filter(col("line").like("%MySQL%")).count()
// Fetches the MySQL errors as an array of strings
errors.filter(col("line").like("%MySQL%")).collect()

What is the similar alternative to reduceByKey in DataFrames

Give following code
case class Contact(name: String, phone: String)
case class Person(name: String, ts:Long, contacts: Seq[Contact])
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
import sqlContext.implicits._
val people = sqlContext.read.format("orc").load("people")
What is the best way to dedupe users by its timestamp
So the user with max ts will stay at collection?
In spark using RDD I would run something like this
rdd.reduceByKey(_ maxTS _)
and would add the maxTS method to Person or add implicits ...
def maxTS(that: Person):Person =
that.ts > ts match {
case true => that
case false => this
}
Is it possible to do the same at DataFrames? and will that be the similar performance?
We are using spark 1.6
You can use Window functions, I'm assuming that the key is name:
import org.apache.spark.sql.functions.{rowNumber, max, broadcast}
import org.apache.spark.sql.expressions.Window
val df = // convert to DataFrame
val win = Window.partitionBy('name).orderBy('ts.desc)
df.withColumn("personRank", rowNumber.over(win))
.where('personRank === 1).drop("personRank")
For each person it will create personRank - each person with given name will have unique number, person with the latest ts will have the lowest rank, equal to 1. The you drop temporary rank
You can do a groupBy and use your preferred aggregation method like sum, max etc.
df.groupBy($"name").agg(sum($"tx").alias("maxTS"))

How to handle dates in Spark using Scala?

I have a flat file that looks like as mentioned below.
id,name,desg,tdate
1,Alex,Business Manager,2016-01-01
I am using the Spark Context to read this file as follows.
val myFile = sc.textFile("file.txt")
I want to generate a Spark DataFrame from this file and I am using the following code to do so.
case class Record(id: Int, name: String,desg:String,tdate:String)
val myFile1 = myFile.map(x=>x.split(",")).map {
case Array(id, name,desg,tdate) => Record(id.toInt, name,desg,tdate)
}
myFile1.toDF()
This is giving me a DataFrame with id as int and rest of the columns as String.
I want the last column, tdate, to be casted to date type.
How can I do that?
You just need to convert the String to a java.sql.Date object. Then, your code can simply become:
import java.sql.Date
case class Record(id: Int, name: String,desg:String,tdate:Date)
val myFile1 = myFile.map(x=>x.split(",")).map {
case Array(id, name,desg,tdate) => Record(id.toInt, name,desg,Date.valueOf(tdate))
}
myFile1.toDF()

How can I pass extra parameters to UDFs in Spark SQL?

I want to parse the date columns in a DataFrame, and for each date column, the resolution for the date may change (i.e. 2011/01/10 => 2011 /01 if the resolution is set to "Month").
I wrote the following code:
def convertDataFrame(dataframe: DataFrame, schema : Array[FieldDataType], resolution: Array[DateResolutionType]) : DataFrame =
{
import org.apache.spark.sql.functions._
val convertDateFunc = udf{(x:String, resolution: DateResolutionType) => SparkDateTimeConverter.convertDate(x, resolution)}
val convertDateTimeFunc = udf{(x:String, resolution: DateResolutionType) => SparkDateTimeConverter.convertDateTime(x, resolution)}
val allColNames = dataframe.columns
val allCols = allColNames.map(name => dataframe.col(name))
val mappedCols =
{
for(i <- allCols.indices) yield
{
schema(i) match
{
case FieldDataType.Date => convertDateFunc(allCols(i), resolution(i)))
case FieldDataType.DateTime => convertDateTimeFunc(allCols(i), resolution(i))
case _ => allCols(i)
}
}
}
dataframe.select(mappedCols:_*)
}}
However it doesn't work. It seems that I can only pass Columns to UDFs. And I wonder if it will be very slow if I convert the DataFrame to RDD and apply the function on each row.
Does anyone know the correct solution? Thank you!
Just use a little bit of currying:
def convertDateFunc(resolution: DateResolutionType) = udf((x:String) =>
SparkDateTimeConverter.convertDate(x, resolution))
and use it as follows:
case FieldDataType.Date => convertDateFunc(resolution(i))(allCols(i))
On a side note you should take a look at sql.functions.trunc and sql.functions.date_format. These should at least part of the job without using UDFs at all.
Note:
In Spark 2.2 or later you can use typedLit function:
import org.apache.spark.sql.functions.typedLit
which support a wider range of literals like Seq or Map.
You can create a literal Column to pass to a udf using the lit(...) function defined in org.apache.spark.sql.functions
For example:
val takeRight = udf((s: String, i: Int) => s.takeRight(i))
df.select(takeRight($"stringCol", lit(1)))