How to handle dates in Spark using Scala? - scala

I have a flat file that looks like as mentioned below.
id,name,desg,tdate
1,Alex,Business Manager,2016-01-01
I am using the Spark Context to read this file as follows.
val myFile = sc.textFile("file.txt")
I want to generate a Spark DataFrame from this file and I am using the following code to do so.
case class Record(id: Int, name: String,desg:String,tdate:String)
val myFile1 = myFile.map(x=>x.split(",")).map {
case Array(id, name,desg,tdate) => Record(id.toInt, name,desg,tdate)
}
myFile1.toDF()
This is giving me a DataFrame with id as int and rest of the columns as String.
I want the last column, tdate, to be casted to date type.
How can I do that?

You just need to convert the String to a java.sql.Date object. Then, your code can simply become:
import java.sql.Date
case class Record(id: Int, name: String,desg:String,tdate:Date)
val myFile1 = myFile.map(x=>x.split(",")).map {
case Array(id, name,desg,tdate) => Record(id.toInt, name,desg,Date.valueOf(tdate))
}
myFile1.toDF()

Related

Spark SQL convert dataset to dataframe

How do I convert a dataset obj to a dataframe? In my example, I am converting a JSON file to dataframe and converting to DataSet. In dataset, I have added some additional attribute(newColumn) and convert it back to a dataframe. Here is my example code:
val empData = sparkSession.read.option("header", "true").option("inferSchema", "true").option("multiline", "true").json(filePath)
.....
import sparkSession.implicits._
val res = empData.as[Emp]
//for (i <- res.take(4)) println(i.name + " ->" + i.newColumn)
val s = res.toDF();
s.printSchema()
}
case class Emp(name: String, gender: String, company: String, address: String) {
val newColumn = if (gender == "male") "Not-allowed" else "Allowed"
}
But I am expected the new column name newColumn added in s.printschema(). output result. But it is not happening? Why? Any reason? How can I achieve this?
The schema of the output with Product Encoder is solely determined based on it's constructor signature. Therefore anything that happens in the body is simply discarded.
You can
empData.map(x => (x, x.newColumn)).toDF("value", "newColumn")

Using a case class to rename split columns with Spark Dataframe

I am splitting 'split_column' to another five columns as per the following code. However I wanted to have this new columns to be renamed so that they would have some meaningful names(let's say new_renamed1", "new_renamed2", "new_renamed3", "new_renamed4", "new_renamed5" in this example)
val df1 = df.withColumn("new_column", split(col("split_column"), "\\|")).select(col("*") +: (0 until 5).map(i => col("new_column").getItem(i).as(s"newcol$i")): _*).drop("split_column","new_column")
val new_columns_renamed = Seq("....., "new_renamed1", "new_renamed2", "new_renamed3", "new_renamed4", "new_renamed5")
val df2 = df1.toDF(new_columns_renamed: _*)
However issue with this approach is some of my splits might have more than fifty new rows. In thi renaming approach, a little typo (like extra comma, missing double quotes) would be painful to detect.
Is there a way to rename columns with case class like below ?
case class SplittedRecord (new_renamed1: String, new_renamed2: String, new_renamed3: String, new_renamed4: String, new_renamed5: String)
Please note that in the actual scenario names would not look like new_renamed1, new_renamed2, ......, new_renamed5 , they would be totally different.
You could try something like this:
import org.apache.spark.sql.catalyst.ScalaReflection
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.Encoders
val names = Encoders.product[SplittedRecord].schema.fieldNames
names.zipWithIndex
.foldLeft(df.withColumn("new_column", split(col("split_column"), "\\|")))
{ case (df, (c, i)) => df.withColumn(c, $"new_column"(i)) }
One of the ways to use the case class
case class SplittedRecord (new_renamed1: String, new_renamed2: String, new_renamed3: String, new_renamed4: String, new_renamed5: String)
is through udf function as
import org.apache.spark.sql.functions._
def splitUdf = udf((array: Seq[String])=> SplittedRecord(array(0), array(1), array(2), array(3), array(4)))
df.withColumn("test", splitUdf(split(col("split_column"), "\\|"))).drop("split_column")
.select(col("*"), col("test.*")).drop("test")

Spark DataFrame not supporting Char datatype

I am creating a Spark DataFrame from a text file. Say Employee file which contains String, Int, Char.
created a class:
case class Emp (
Name: String,
eid: Int,
Age: Int,
Sex: Char,
Sal: Int,
City: String)
Created RDD1 using split, then created RDD2:
val textFileRDD2 = textFileRDD1.map(attributes => Emp(
attributes(0),
attributes(1).toInt,
attributes(2).toInt,
attributes(3).charAt(0),
attributes(4).toInt,
attributes(5)))
And Final RDDS as:
finalRDD = textFileRDD2.toDF
when I create final RDD it throws the error:
java.lang.UnsupportedOperationException: No Encoder found for scala.Char"
can anyone help me out why and how to resolve it?
Spark SQL doesn't provide Encoders for Char and generic Encoders are not very useful.
You can either use a StringType:
attributes(3).slice(0, 1)
or ShortType (or BooleanType, ByteType if you accept only binary response):
attributes(3)(0) match {
case 'F' => 1: Short
...
case _ => 0: Short
}

SPARK SQL : How to convert List[List[Any]) to Data Frame

val list = List(List(1,"Ankita"),List(2,"Kunal"))
and now I want to convert it into the data frame -
val list = List(List(1,"Ankita"),List(2,"Kunal")).toDF("id","name")
but it throws an error -
java.lang.ClassNotFoundException: Scala.any
AFAIK, List[List[Any]] cannot be converted to DataFrame directly, It need to convert to some type (here I took example to Person) List[Person]
case class Person(id: Int, name: String)
val list = List(List(1,"Ankita"),List(2,"Kunal"))
val listDf = list.map(x => Person(x(0).asInstanceOf[Int], x(1).toString)).toDF("id","name")
Another way is per the comment of user8371915, Create list of pairs and convert to DataFrame
val listDf = list.map {
case List(id: Int, name: String) => (id, name) } toDF("id", "name")
Because List (inside List) can be arbitrary size and can not using implicit type conversion.
It can be converted if you change to List of Tuple.
val list = List((1,"Ankita"),(2,"Kunal")).toDF("id","name")

How to convert datatype in SPARK SQL to specific datatype but RDD result to a specifical class

I am reading a csv file and need to create a RDDSchema
I read the file by using the sqlContext.csvFile
val testfile = sqlContext.csvFile("file")
testfile.registerTempTable(testtable)
I wanted to change the pick some of the fields and return an RDD type of those fields
For example : class Test(ID: String, order_date: Date, Name: String, value: Double)
Using sqlContext.sql("Select col1, col2, col3, col4 FROM ...)
val testfile = sqlContext.sql("Select col1, col2, col3, col4 FROM testtable).collect
testfile.getClass
Class[_ <: Array[org.apache.spark.sql.Row]] = class [Lorg.apache.spark.sql.Row;
So I wanted to change col1 to double, col2 to a date , and column3 to string?
Is there a way to do this in the sqlContext.sql or I have to run a map function to the result and then turn it back to RDD..
I tried to do the do the item in one statement and I got this error:
val old_rdd : RDD[Test] = sqlContext.sql("SELECT col, col2, col3,col4 FROM testtable").collect.map(t => (t(0) : String ,dateFormat.parse(dateFormat.format(1)),t(2) : String, t(3) : Double))
The issue I am having is the assignment does not result on RDD[Test] where Test is a class defined
The error is saying that the map command is coming out as an Array Class and not an RDD Class
found : Array[edu.model.Test]
[error] required: org.apache.spark.rdd.RDD[edu.model.Test]
Lets say you have a case class like this:
case class Test(
ID: String, order_date: java.sql.Date, Name: String, value: Double)
Since you load your data with csvFile with default parameters it doesn't perform any schema inference and your data is stored as plain strings. Lets assume that there are no other fields:
val df = sc.parallelize(
("ORD1", "2016-01-02", "foo", "2.23") ::
("ORD2", "2016-07-03", "bar", "9.99") :: Nil
).toDF("col1", "col2", "col3", "col4")
Your attempt to use map is wrong for more than one reason:
function you use annotates individual values with incorrect types. Not only Row.apply is of type Int => Any but also your data table contains shouldn't contain any Double values
since you collect (which doesn't makes sense here) you fetch all data to the driver and result is local Array not RDD
finally, if all previous issues were resolved, (String, Date, String, Double) is clearly not a Test
One way to handle this:
import org.apache.spark.sql.Row
import org.apache.spark.rdd.RDD
val casted = df.select(
$"col1".alias("ID"),
$"col2".cast("date").alias("order_date"),
$"col3".alias("name"),
$"col4".cast("double").alias("value")
)
val tests: RDD[Test] = casted.map {
case Row(id: String, date: java.sql.Date, name: String, value: Double) =>
Test(id, date, name, value)
}
You can also try to use new Dataset API but it is far from stable:
casted.as[Test].rdd