Spark SQL convert dataset to dataframe - scala

How do I convert a dataset obj to a dataframe? In my example, I am converting a JSON file to dataframe and converting to DataSet. In dataset, I have added some additional attribute(newColumn) and convert it back to a dataframe. Here is my example code:
val empData = sparkSession.read.option("header", "true").option("inferSchema", "true").option("multiline", "true").json(filePath)
.....
import sparkSession.implicits._
val res = empData.as[Emp]
//for (i <- res.take(4)) println(i.name + " ->" + i.newColumn)
val s = res.toDF();
s.printSchema()
}
case class Emp(name: String, gender: String, company: String, address: String) {
val newColumn = if (gender == "male") "Not-allowed" else "Allowed"
}
But I am expected the new column name newColumn added in s.printschema(). output result. But it is not happening? Why? Any reason? How can I achieve this?

The schema of the output with Product Encoder is solely determined based on it's constructor signature. Therefore anything that happens in the body is simply discarded.
You can
empData.map(x => (x, x.newColumn)).toDF("value", "newColumn")

Related

Dynamic conversion of Array of double columns into multiple columns in nested spark dataframe

My current DataFrame looks like as below:
{"id":"1","inputs":{"values":{"0.2":[1,1],"0.4":[1,1],"0.6":[1,1]}},"id1":[1,2]}
I want to transform this dataframe into the below dataFrame:
{"id":"1", "v20":[1,1],"v40":[1,1],"v60":[1,1],"id1":[1,2]}
This means that, each 'values' array's items (0.2, 0.4 and 0.6) will be multiplied by 100, prepended with the letter 'v', and extracted into separate columns.
How does the code would look like in order to achieve this. I have tried withColumn but couldn't achieve this.
Try the below code and please find the inline comments for the code explanation
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StructType
object DynamicCol {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("local[*]").getOrCreate()
val df = spark.read.json("src/main/resources/dyamicCol.json") /// Load the JSON file
val dfTemp = df.select(col("inputs.values").as("values")) // Temp Dataframe for fetching the nest values
val index = dfTemp
.schema.fieldIndex("values")
val propSchema = dfTemp.schema(index).dataType.asInstanceOf[StructType]
val dfFinal = propSchema.fields.foldLeft(df)( (df,field) => { // Join Dataframe with the list of nested columns
val colNameInt = (field.name.toDouble * 100).toInt
val colName = s"v$colNameInt"
df.withColumn(colName,col("inputs.values.`" + field.name + "`")) // Add the nested column mappings
} ).drop("inputs") // Drop the extra column
dfFinal.write.mode(SaveMode.Overwrite).json("src/main/resources/dyamicColOut.json") // Output the JSON file
}
}
I would make the logic for the change of column name splitter into 2 parts, the one that is a numeric value, and the one that doesn't change.
def stringDecimalToVNumber(colName:String): String =
"v" + (colName.toFloat * 100).toInt.toString
and form a single function that transforms according to the case
val floatRegex = """(\d+\.?\d*)""".r
def transformColumnName(colName:String): String = colName match {
case floatRegex(v) => stringDecimalToVNumber(v) //it's a float, transform it
case x => x // keep it
now we have the function to transform the end of the columns, let's pick the schema dynamicly.
val flattenDF = df.select("id","inputs.values.*")
val finalDF = flattenDF
.schema.names
.foldLeft(flattenDF)((dfacum,x) => {
val newName = transformColumnName(x)
if (newName == x)
dfacum // the name didn't need to be changed
else
dfacum.withColumnRenamed(x, transformColumnName(x))
})
This will dynamically transform all the columns inside inputs.values to the new name, and put them in next to id.

Getting datatype of all columns in a dataframe using scala

I have a data frame in which following data is loaded:
enter image description here
I am trying to develop a code which will read data from any source, load the data into data frame and return the following o/p:
enter image description here
You can use the schema property and then iterate over the fields.
Example:
Seq(("A", 1))
.toDF("Field1", "Field2")
.schema
.fields
.foreach(field => println(s"${field.name}, ${field.dataType}"))
Results:
Make sure to take a look at the Spark ScalaDoc.
Thats the closest I could get to the output.
Create schema from case class and and there after created DF from the list of schema columns and mapped them to a Dataframe
import java.sql.Date
object GetColumnDf {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
val map = Map("Emp_ID" -> "Dimension","Cust_Name" -> "Dimension","Cust_Age" -> "Measure",
"Salary" -> "Measure","DoJ" -> "Dimension")
import spark.implicits._
val lsit = Seq(Bean123("C-1001","Jack",25,3000,new Date(2000000))).toDF().schema.fields
.map( col => (col.name,col.dataType.toString,map.get(col.name))).toList.toDF("Headers","Data_Type","Type")
lsit.show()
}
}
case class Bean123(Emp_ID: String,Cust_Name: String,Cust_Age: Int, Salary : Int,DoJ: Date)

Add column while maintaining correlation of the existing columns in Apache Spark Scala

I have a dataframe with columns review and rating in Spark Scala
val stopWordsList = scala.io.Source.fromFile("stopWords").getLines.toList
val downSampleReviewsDF = sqlContext.sql("SELECT review, rating FROM ds");
I have written a function which will remove stopWords from a given review (String)
def cleanTextFunc(text: String, removeList: List[String]): String = removeList.fold(text) {
case (text, termToRemove) => text.replaceAll("\\b" + text + "\\b" , "").replaceAll("""[\p{Punct}&&[^.]]""", "").replaceAll(" +", " ")
}
How do I add another column "new_review" along with review and rating. The new_review should use cleanTextFunc() to get cleaned data for every row. cleanTextFunc takes two input arguments 1. text to clean 2. List of stop words to be removed from the text
Output should have Text | Rating | New_Text
Just a few more lines
// Curried method to create UDF from removeList
def getStopWordRemoverUdf(removeList: List[String]): UserDefinedFunction = {
udf { (text: String) =>
cleanTextFunc(text, removeList)
}
}
// Create UDF by passing your removeList
val stopWordRemoverUdf: UserDefinedFunction = getStopWordRemoverUdf(removeList)
// Use UDF to create new column
val cleanedDownSampleReviewsDf: DataFrame = downSampleReviewsDF.withColumn("new_review", stopWordRemoverUdf(downSampleReviewsDF("review")))
References
Passing extra parameters to UDF in Spark

Creating column types for schema

I have a text file that I read from and parse to create a dataframe. However, the columns amount and code should be IntegerTypes. Here's what I have:
def getSchema: StructType = {
StructType(Seq(
StructField("carrier", StringType, false),
StructField("amount", StringType, false),
StructField("currency", StringType, false),
StructField("country", StringType, false),
StructField("code", StringType, false),
))
}
def getRow(x: String): Row = {
val columnArray = new Array[String](5)
columnArray(0) = x.substring(40, 43)
columnArray(1) = x.substring(43, 46)
columnArray(2) = x.substring(46, 51)
columnArray(3) = x.substring(51, 56)
columnArray(4) = x.substring(56, 64)
Row.fromSeq(columnArray)
}
Because I have Array[String] defined, the columns can only be StringTypes and not a variety of both String and Integer. To explain in detail my problem, here's what happens:
First I create an empty dataframe:
var df = spark.sqlContext.createDataFrame(spark.sparkContext.emptyRDD[Row], getSchema)
Then I have a for loop that goes through each file in all the directories. Note: I need to validate every file and cannot read all at once.
for (each file parse):
df2 = spark.sqlContext.createDataFrame(spark.sparkContext.textFile(inputPath)
.map(x => getRow(x)), schema)
df = df.union(df2)
I now have a complete dataframe of all the files. However, columns amount and code are StringTypes still. How can I make it so that they are IntegerTypes?
Please note: I cannot cast the columns during the for-loop process because it takes a lot of time. I'd like to keep the current structure I have as similar as possible. At the end of the for loop, I could cast the columns as IntegerTypes, however, what if the column contains a value that is not an Integer? I'd like for the columns to be not NULL.
Is there a way to make the 2 specified columns IntegerTypes without adding a lot of change to the code?
What about using datasets?
First create a case class modelling your data:
case class MyObject(
carrier: String,
amount: Double,
currency: String,
country: String,
code: Int)
create an other case class wrapping the first one with additional infos (potential errors, source file):
case class MyObjectWrapper(
myObject: Option[MyObject],
someError: Option[String],
source: String
)
Then create a parser, transforming a line from your file in myObject:
object Parser {
def parse(line: String, file: String): MyObjectWrapper = {
Try {
MyObject(
carrier = line.substring(40, 43),
amount = line.substring(43, 46).toDouble,
currency = line.substring(46, 51),
country = line.substring(51, 56),
code = line.substring(56, 64).toInt)
} match {
case Success(objectParsed) => MyObjectWrapper(Some(objectParsed), None, file)
case Failure(error) => MyObjectWrapper(None, Some(error.getLocalizedMessage), file)
}
}
}
Finally, parse your files:
import org.apache.spark.sql.functions._
val ds = files
.filter( {METHOD TO SELECT CORRECT FILES})
.map( { GET INPUT PATH FROM FILES} )
.map(path => spark.read.textFile(_).map(Parser.parse(_, path))
.reduce(_.union(_))
This should give you a Dataset[MyObjectWrapper] with the types and APIs you wish.
Afterwards you can take those you could parse:
ds.filter(_.someError == None)
Or take those you failed to parse (for investigation):
ds.filter(_.someError != None)

How to handle dates in Spark using Scala?

I have a flat file that looks like as mentioned below.
id,name,desg,tdate
1,Alex,Business Manager,2016-01-01
I am using the Spark Context to read this file as follows.
val myFile = sc.textFile("file.txt")
I want to generate a Spark DataFrame from this file and I am using the following code to do so.
case class Record(id: Int, name: String,desg:String,tdate:String)
val myFile1 = myFile.map(x=>x.split(",")).map {
case Array(id, name,desg,tdate) => Record(id.toInt, name,desg,tdate)
}
myFile1.toDF()
This is giving me a DataFrame with id as int and rest of the columns as String.
I want the last column, tdate, to be casted to date type.
How can I do that?
You just need to convert the String to a java.sql.Date object. Then, your code can simply become:
import java.sql.Date
case class Record(id: Int, name: String,desg:String,tdate:Date)
val myFile1 = myFile.map(x=>x.split(",")).map {
case Array(id, name,desg,tdate) => Record(id.toInt, name,desg,Date.valueOf(tdate))
}
myFile1.toDF()