Creating a DataFrame containing instances of a case class with integer attributes - scala

Spark fails to convert a sequence of such instances to a DataFrame/Dataset, which is not a great developer experience.
Is this a bug or an expected feature?
Consider an example:
import spark.implicits._
case class Test(`9`: Double)
This fails:
val failingTest = Seq(Test(9.0)).toDF()
This works fine:
val successfulTest = Seq(9.0).toDF("9").as[Test]
successfulTest.show()

If you see the prerequisites of declaring a variable and how you name them, then how you are specifying a variable name is not allowed.
The variable name must start with a letter and cannot begin with a number or other characters
Just change your case class to use string variable name without symbol as below and it should work.
import spark.implicits._
case class Test(Id: Double)
val failingTest = Seq(Test(9.0)).toDS()
The above line of code would give you a dataset of Test as below :
failingTest: org.apache.spark.sql.Dataset[Test] = [Id: double]
val failingTest1 = Seq(test).toDF()
The above line would give you a Dataset of Row which is DataFrame as below:
failingTest1: org.apache.spark.sql.DataFrame = [Id: double]
It is giving you error because you are not providing the name in proper format.
You could also consider it as bug as it is not giving you error when you create a case class which I think it should have done.
Your code would also work if you just remove the value 9 and use any character as below:
import spark.implicits._
case class Test1(`i`: Double)
val failingTest = Seq(Test1(9.0)).toDF()

Related

How to infer StructType schema for Spark Scala at run time given a Fully Qualified Name of a case class

Since a few days I was wondering if it is possible to infer a schema for Spark in Scala for a given case class, but unknown at compile time.
The only input is a string containing the FQN of the class (that could be used for example to create an instance of the case class at runtime via reflection)
I was thinking if it was possible to do something like:
package com.my.namespace
case class MyCaseClass (name: String, num: Int)
//Somewhere else in codebase
// coming from external configuration file, so unknown at compile time
val fqn = "com.my.namespace.MyCaseClass"
val schema = Encoders.product [ getXYZ( fqn ) ].schema
Of course, any other techniques that is not using Encoders is fine (building StructType analysing an instance of the case class ? Is it even possible ?)
What is the best approach?
Is it something feasible ?
You can use reflective toolbox
package com.my.namespace
import org.apache.spark.sql.types.StructType
import scala.reflect.runtime
import scala.tools.reflect.ToolBox
case class MyCaseClass (name: String, num: Int)
object Main extends App {
val fqn = "com.my.namespace.MyCaseClass"
val runtimeMirror = runtime.currentMirror
val toolbox = runtimeMirror.mkToolBox()
val res = toolbox.eval(toolbox.parse(s"""
import org.apache.spark.sql.Encoders
Encoders.product[$fqn].schema
""")).asInstanceOf[StructType]
println(res) // StructType(StructField(name,StringType,true),StructField(num,IntegerType,false))
}

Gatling: Dynamically assemble HttpCheck for multiple Css selectors

I am working on a Gatling test framework that can be parameterized through external config objects. One use case I have is that there may be zero or more CSS selector checks that need to be saved to variables. In my config object, I've implemented that as a Map[String,(String, String)], where the key is the variable name, and the value is the 2-part css selector.
I am struggling with how to dynamically assemble the check. Here's what I got so far:
val captureMap: Map[String, (String, String)] = config.capture
httpRequestBuilder.check(
captureMap.map((mapping) => {
val varName = mapping._1
val cssSel = mapping._2
css(cssSel._1, cssSel._2).saveAs(varName)
}).toArray: _* // compilation error here
)
The error I'm getting is:
Error:(41, 10) type mismatch;
found : Array[io.gatling.core.check.CheckBuilder[io.gatling.core.check.css.CssCheckType,jodd.lagarto.dom.NodeSelector,String]]
required: Array[_ <: io.gatling.http.check.HttpCheck]
}).toArray: _*
apparently, I need to turn my CheckBuilder into a HttpCheck, so how do I do that?
Update:
I managed to get it to work by introducing a variable of type HttpCheck and returning it in the next line:
httpRequestBuilder.check(
captureMap.map((mapping) => {
val varName = mapping._1
val cssSel = mapping._2
val check:HttpCheck= css(cssSel._1, cssSel._2).saveAs(varName)
check
}).toArray: _*
)
While this works, it's ugly as hell. Can this be improved?
I had the same issue.
I had the following imports:
import io.gatling.core.Predef._
import io.gatling.http.Predef.http
I changed these imports to:
import io.gatling.core.Predef._
import io.gatling.http.Predef._
import io.gatling.http.request.builder.HttpRequestBuilder.toActionBuilder
which made it work.

Spark unable to find encoder(case class) although providing it

Trying to figure out why getting an error on encoders, any insight would be helpful!
ERROR Unable to find encoder for type SolrNewsDocument, An implicit Encoder[SolrNewsDocument] is needed to store `
Clearly I have imported spark.implicits._. I have also have provided an encoder as a case class.
def ingestDocsToSolr(newsItemDF: DataFrame) = {
case class SolrNewsDocument(
title: String,
body: String,
publication: String,
date: String,
byline: String,
length: String
)
import spark.implicits._
val solrDocs = newsItemDF.as[SolrNewsDocument].map { doc =>
val solrDoc = new SolrInputDocument
solrDoc.setField("title", doc.title.toString)
solrDoc.setField("body", doc.body)
solrDoc.setField("publication", doc.publication)
solrDoc.setField("date", doc.date)
solrDoc.setField("byline", doc.byline)
solrDoc.setField("length", doc.length)
solrDoc
}
// can be used for stream SolrSupport.
SolrSupport.indexDocs("localhost:2181", "collection", 10, solrDocs.rdd);
val solrServer = SolrSupport.getCachedCloudClient("localhost:2181")
solrServer.setDefaultCollection("collection")
solrServer.commit(false, false)
}
//Check this one.-Move case class declaration before function declaration.
//Encoder is created once case class statement is executed by compiler. Then only compiler will be able to use encoder inside function deceleration.
import spark.implicits._
case class SolrNewsDocument(title: String,body: String,publication: String,date: String,byline: String,length: String)
def ingestDocsToSolr(newsItemDF:DataFrame) = {
val solrDocs = newsItemDF.as[SolrNewsDocument]}
i got this error trying to iterate over a text file, and in my case, as of spark 2.4.x the issue was that i had to cast it to an RDD first (that used to be implicit)
textFile
.rdd
.flatMap(line=>line.split(" "))
Migrating our Scala codebase to Spark 2

Using a case class to rename split columns with Spark Dataframe

I am splitting 'split_column' to another five columns as per the following code. However I wanted to have this new columns to be renamed so that they would have some meaningful names(let's say new_renamed1", "new_renamed2", "new_renamed3", "new_renamed4", "new_renamed5" in this example)
val df1 = df.withColumn("new_column", split(col("split_column"), "\\|")).select(col("*") +: (0 until 5).map(i => col("new_column").getItem(i).as(s"newcol$i")): _*).drop("split_column","new_column")
val new_columns_renamed = Seq("....., "new_renamed1", "new_renamed2", "new_renamed3", "new_renamed4", "new_renamed5")
val df2 = df1.toDF(new_columns_renamed: _*)
However issue with this approach is some of my splits might have more than fifty new rows. In thi renaming approach, a little typo (like extra comma, missing double quotes) would be painful to detect.
Is there a way to rename columns with case class like below ?
case class SplittedRecord (new_renamed1: String, new_renamed2: String, new_renamed3: String, new_renamed4: String, new_renamed5: String)
Please note that in the actual scenario names would not look like new_renamed1, new_renamed2, ......, new_renamed5 , they would be totally different.
You could try something like this:
import org.apache.spark.sql.catalyst.ScalaReflection
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.Encoders
val names = Encoders.product[SplittedRecord].schema.fieldNames
names.zipWithIndex
.foldLeft(df.withColumn("new_column", split(col("split_column"), "\\|")))
{ case (df, (c, i)) => df.withColumn(c, $"new_column"(i)) }
One of the ways to use the case class
case class SplittedRecord (new_renamed1: String, new_renamed2: String, new_renamed3: String, new_renamed4: String, new_renamed5: String)
is through udf function as
import org.apache.spark.sql.functions._
def splitUdf = udf((array: Seq[String])=> SplittedRecord(array(0), array(1), array(2), array(3), array(4)))
df.withColumn("test", splitUdf(split(col("split_column"), "\\|"))).drop("split_column")
.select(col("*"), col("test.*")).drop("test")

How to handle dates in Spark using Scala?

I have a flat file that looks like as mentioned below.
id,name,desg,tdate
1,Alex,Business Manager,2016-01-01
I am using the Spark Context to read this file as follows.
val myFile = sc.textFile("file.txt")
I want to generate a Spark DataFrame from this file and I am using the following code to do so.
case class Record(id: Int, name: String,desg:String,tdate:String)
val myFile1 = myFile.map(x=>x.split(",")).map {
case Array(id, name,desg,tdate) => Record(id.toInt, name,desg,tdate)
}
myFile1.toDF()
This is giving me a DataFrame with id as int and rest of the columns as String.
I want the last column, tdate, to be casted to date type.
How can I do that?
You just need to convert the String to a java.sql.Date object. Then, your code can simply become:
import java.sql.Date
case class Record(id: Int, name: String,desg:String,tdate:Date)
val myFile1 = myFile.map(x=>x.split(",")).map {
case Array(id, name,desg,tdate) => Record(id.toInt, name,desg,Date.valueOf(tdate))
}
myFile1.toDF()