Unit testing with Spark dataframes - scala

I'm trying to test a part of my program which performs transformations on dataframes
I want to test several different variations of these dataframe which rules out the option of reading a specific DF from a file
And so my questions are:
Is there any good tutorial on how to perform unit testing with Spark and dataframes, especially regarding the dataframes creation?
How can I create these different several lines dataframes without a lot of boilerplate and without reading these from a file?
Are there any utility classes for checking for specific values inside a dataframe?
I obviously googled that before but could not find anything which was very useful. Among the more useful links I found were:
Running a basic unit test with a dataframe
Custom made assertions with DF
It would be great if examples/tutorials are in Scala but I'll take whatever language you've got
Thanks in advance

This link shows how we can programmatically create a data frame with schema. You can keep the data in separate traits and mix it in with your tests. For instance,
// This example assumes CSV data. But same approach should work for other formats as well.
trait TestData {
val data1 = List(
"this,is,valid,data",
"this,is,in-valid,data",
)
val data2 = ...
}
Then with ScalaTest, we can do something like this.
class MyDFTest extends FlatSpec with Matchers {
"method" should "perform this" in new TestData {
// You can access your test data here. Use it to create the DataFrame.
// Your test here.
}
}
To create the DataFrame, you can have few util methods like below.
def schema(types: Array[String], cols: Array[String]) = {
val datatypes = types.map {
case "String" => StringType
case "Long" => LongType
case "Double" => DoubleType
// Add more types here based on your data.
case _ => StringType
}
StructType(cols.indices.map(x => StructField(cols(x), datatypes(x))).toArray)
}
def df(data: List[String], types: Array[String], cols: Array[String]) = {
val rdd = sc.parallelize(data)
val parser = new CSVParser(',')
val split = rdd.map(line => parser.parseLine(line))
val rdd = split.map(arr => Row(arr(0), arr(1), arr(2), arr(3)))
sqlContext.createDataFrame(rdd, schema(types, cols))
}
I am not aware of any utility classes for checking specific values in a DataFrame. But I think it should be simple to write one using the DataFrame APIs.

You could use SharedSQLContext and SharedSparkSession that Spark uses for its own unit tests. Check my answer for examples.

For those looking to achieve something similar in Java, you can use start by using this project to initialize a SparkContext within your unit tests: https://github.com/holdenk/spark-testing-base
I personally had to mimick the file structure of some AVRO files. So I used Avro-tools (https://avro.apache.org/docs/1.8.2/gettingstartedjava.html#download_install) to extract the schema from my binary records using the following command:
java -jar $AVRO_HOME/avro tojson largeAvroFile.avro | head -3
Then, using this small helper method, you can convert the output JSON into a DataFrame to use in your unit tests.
private DataFrame getDataFrameFromList() {
SQLContext sqlContext = new SQLContext(jsc());
ImmutableList<String> elements = ImmutableList.of(
{"header":{"appId":"myAppId1","clientIp":"10.22.63.3","createdDate":"2017-05-10T02:09:59.984Z"}}
{"header":{"appId":"myAppId1","clientIp":"11.22.63.3","createdDate":"2017-05-11T02:09:59.984Z"}}
{"header":{"appId":"myAppId1","clientIp":"12.22.63.3","createdDate":"2017-05-11T02:09:59.984Z"}}
);
JavaRDD<String> parallelize = jsc().parallelize(elements);
return sqlContext.read().json(parallelize);
}

Related

Find columns to select, for spark.read(), from another Dataset - Spark Scala

I have a Dataset[Year] that has the following schema:
case class Year(day: Int, month: Int, Year: Int)
Is there any way to make a collection of the current schema?
I have tried:
println("Print -> "+ds.collect().toList)
But the result were:
Print -> List([01,01,2022], [31,01,2022])
I expected something like:
Print -> List(Year(01,01,2022), Year(31,01,2022)
I know that with a map I can adjust it, but I am trying to create a generic method that accepts any schema, and for this I cannot add a map doing the conversion.
That is my method:
class SchemeList[A]{
def set[A](ds: Dataset[A]): List[A] = {
ds.collect().toList
}
}
Apparently the method return is getting the correct signature, but when running the engine, it gets an error:
val setYears = new SchemeList[Year]
val YearList: List[Year] = setYears.set(df)
Exception in thread "main" java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to schemas.Schemas$Year
Based on your additional information in your comment:
I need this list to use as variables when creating another dataframe via jdbc (I need to make a specific select within postgresql). Is there a more performative way to pass values from a dataframe as parameters in a select?
Given your initial dataset:
val yearsDS: Dataset[Year] = ???
and that you want to do something like:
val desiredColumns: Array[String] = ???
spark.read.jdbc(..).select(desiredColumns.head, desiredColumns.tail: _*)
You could find the column names of yearsDS by doing:
val desiredColumns: Array[String] = yearsDS.columns
Spark achieves this by using def schema, which is defined on Dataset.
You can see the definition of def columns here.
May be you got a DataFrame,not a DataSet.
try to use "as" to transform dataframe to dataset.
like this
val year = Year(1,1,1)
val years = Array(year,year).toList
import spark.implicits._
val df = spark.
sparkContext
.parallelize(years)
.toDF("day","month","Year")
.as[Year]
println(df.collect().toList)

How can I introspect and pre-load all collections from MongoDB into the Spark SQL catalog?

When learning Spark SQL, I've been using the following approach to register a collection into the Spark SQL catalog and query it.
val persons: Seq[MongoPerson] = Seq(MongoPerson("John", "Doe"))
sqlContext.createDataset(persons)
.write
.format("com.mongodb.spark.sql.DefaultSource")
.option("collection", "peeps")
.mode("append")
.save()
sqlContext.read
.format("com.mongodb.spark.sql.DefaultSource")
.option("collection", "peeps")
.load()
.as[Peeps]
.show()
However, when querying it, it seems that I need to register it as a temporary view in order to access it using SparkSQL.
val readConfig = ReadConfig(Map("uri" -> "mongodb://localhost:37017/test", "collection" -> "morepeeps"), Some(ReadConfig(spark)))
val people: DataFrame = MongoSpark.load[Peeps](spark, readConfig)
people.show()
people.createOrReplaceTempView("peeps")
spark.catalog.listDatabases().show()
spark.catalog.listTables().show()
sqlContext.sql("SELECT * FROM peeps")
.as[Peeps]
.show()
For a database with quite a few collections, is there a way to hydrate the Spark SQL schema catalog so that this op isn't so verbose?
So there's a couple things going on. First of all, simply loading the Dataset using sqlContext.read will not register it with SparkSQL catalog. The end of the function chain you have in your first code sample returns a Dataset at .as[Peeps]. You need to tell Spark that you want to use it as a view.
Depending on what you're doing with it, I might recommend leaning on the Scala Dataset API rather than SparkSQL. However, if SparkSQL is absolutely essential, you can likely speed things up programmatically.
In my experience, you'll need to run that boilerplate on each table you want to import. Fortunately, Scala is a proper programming language, so we can cut down on code duplication substantially by using a function, and calling it as such:
val MongoDbUri: String = "mongodb://localhost:37017/test" // store this as a constant somewhere
// T must be passed in as some case class
// Note, you can also add a second parameter to change the view name if so desired
def loadTableAsView[T <: Product : TypeTag](table: String)(implicit spark: SparkSession): Dataset[T] {
val configMap = Map(
"uri" -> MongoDbUri,
"collection" -> table
)
val readConfig = ReadConfig(configMap, Some(ReadConfig(spark)))
val df: DataFrame = MongoSpark.load[T](spark, readConfig)
df.createOrReplaceTempView(table)
df.as[T]
}
And to call it:
// Note: if spark is defined implicitly, e.g. implicit val spark: SparkSession = spark, you won't need to pass it explicitly
val peepsDS: Dataset[Peeps] = loadTableAsView[Peeps]("peeps")(spark)
val chocolatesDS: Dataset[Chocolates] = loadTableAsView[Chocolates]("chocolates")(spark)
val candiesDS: Dataset[Candies] = loadTableAsView[Candies]("candies")(spark)
spark.catalog.listDatabases().show()
spark.catalog.listTables().show()
peepsDS.show()
chocolatesDS.show()
candiesDS.show()
This will substantially cut down your boilerplate, and also allow you to more easily write some tests for that repeated bit of code. There's also probably a way to create a map of table names to case classes that you can then iterate over, but I don't have an IDE handy to test it out.

How to pass DataSet(s) to a function that accepts DataFrame(s) as arguments in Apache Spark using Scala?

I have a library in Scala for Spark which contains many functions.
One example is the following function to unite two dataframes that have different columns:
def appendDF(df2: DataFrame): DataFrame = {
val cols1 = df.columns.toSeq
val cols2 = df2.columns.toSeq
def expr(sourceCols: Seq[String], targetCols: Seq[String]): Seq[Column] = {
targetCols.map({
case x if sourceCols.contains(x) => col(x)
case y => lit(null).as(y)
})
}
// both df's need to pass through `expr` to guarantee the same order, as needed for correct unions.
df.select(expr(cols1, cols1): _*).union(df2.select(expr(cols2, cols1): _*))
}
I would like to use this function (and many more) to Dataset[CleanRow] and not DataFrames. CleanRow is a simple class here that defines the names and types of the columns.
My educated guess is to convert the Dataset into Dataframe using .toDF() method. However, I would like to know whether there are better ways to do it.
From my understanding, there shouldn't be many differences between Dataset and Dataframe since Dataset are just Dataframe[Row]. Plus, I think that from Spark 2.x the APIs for DF and DS have been unified, so I was thinking that I could pass either of them interchangeably, but that's not the case.
If changing signature is possible:
import spark.implicits._
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.Dataset
def f[T](d: Dataset[T]): Dataset[T] = {d}
// You are able to pass a dataframe:
f(Seq(0,1).toDF()).show
// res1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int]
// You are also able to pass a dataset:
f(spark.createDataset(Seq(0,1)))
// res2: org.apache.spark.sql.Dataset[Int] = [value: int]

Using map() and filter() in Spark instead of spark.sql

I have two datasets that I want to INNER JOIN to give me a whole new table with the desired data. I used SQL and manage to get it. But now I want to try it with map() and filter(), is it possible?
This is my code using the SPARK SQL:
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
object hello {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setMaster("local")
.setAppName("quest9")
val sc = new SparkContext(conf)
val spark = SparkSession.builder().appName("quest9").master("local").getOrCreate()
val zip_codes = spark.read.format("csv").option("header", "true").load("/home/hdfs/Documents/quest_9/doc/zip.csv")
val census = spark.read.format("csv").option("header", "true").load("/home/hdfs/Documents/quest_9/doc/census.csv")
census.createOrReplaceTempView("census")
zip_codes.createOrReplaceTempView("zip")
//val query = spark.sql("SELECT * FROM census")
val query = spark.sql("SELECT DISTINCT census.Total_Males AS male, census.Total_Females AS female FROM census INNER JOIN zip ON census.Zip_Code=zip.Zip_Code WHERE zip.City = 'Inglewood' AND zip.County = 'Los Angeles'")
query.show()
query.write.parquet("/home/hdfs/Documents/population/census/IDE/census.parquet")
sc.stop()
}
}
The only sensible way, in general to do this would be to use the join() method of `Dataset̀€. I would urge you to question the need to use only map/filter to do this, as this is not intuitive, and will probably confuse any experienced spark developer (or simply put, make him roll his eyes). It may also lead to scalability issues should the dataset grow.
That said, in your use case, it is pretty simple to avoid using join. Another possibility would be to issue two separate jobs to spark :
fetch the zip code(s) that interests you
filter on the census data on that (those) zip code(s)
Step 1 collect the zip codes of interest (not sure of the exact syntax as I do not have a spark shell at hand, but it should be trivial to find the right one).
var codes: Seq[String] = zip_codes
// filter on the city
.filter(row => row.getAs[String]("City").equals("Inglewood"))
// filter on the county
.filter(row => row.getAs[String]("County").equals("Los Angeles"))
// map to zip code as a String
.map(row => row.getAs[String]("Zip_Code"))
.as[String]
// Collect on the driver side
.collect()
Then again, writing it this way instead of using select/where is pretty strange to anyone being used to spark.
Yet, the reason this will work is because we can be sure that zip codes matching a given town and county will be really small. So it is safe to perform driver side collcetion of the result.
Now on to step 2 :
census.filter(row => codes.contains(row.getAs[String]("Zip_Code")))
.map( /* whatever to get your data out */ )
What you need is a join, your query roughly translates to :
census.as("census")
.join(
broadcast(zip_codes
.where($"City"==="Inglewood")
.where($"County"==="Los Angeles")
.as("zip"))
,Seq("Zip_Code"),
"inner" // "leftsemi" would also be sufficient
)
.select(
$"census.Total_Males".as("male"),
$"census.Total_Females".as("female")
).distinct()

Not able to apply function to Spark Dataframe Column

I am trying to apply a function to one of my dataframe columns to convert the values. The values in the column are like "20160907" I need value to be "2016-09-07".
I wrote a function like this:
def convertDate(inDate:String ): String = {
val year = inDate.substring(0,4)
val month = inDate.substring(4,6)
val day = inDate.substring(6,8)
return year+'-'+month+'-'+day
}
And in my spark scala code, I am using this:
def final_Val {
val oneDF = hiveContext.read.orc("/tmp/new_file.txt")
val convertToDate_udf = udf(convertToDate _)
val convertedDf = oneDF.withColumn("modifiedDate", convertToDate_udf(col("EXP_DATE")))
convertedDf.show()
}
Suprisingly, in spark shell I am able to run without any error. In scala IDE I am getting the below compilation error:
Multiple markers at this line:
not enough arguments for method udf: (implicit evidence$2:
reflect.runtime.universe.TypeTag[String], implicit evidence$3: reflect.runtime.universe.TypeTag[String])org.apache.spark.sql.UserDefinedFunction. Unspecified value parameters evidence$2, evidence$3.
I am using Spark 1.6.2, Scala 2.10.5
Can someone please tell me what I am doing wrong here?
Same code I tried with different functions like in this post: stackoverflow.com/questions/35227568/applying-function-to-spark-dataframe-column".
I am not getting any compilation issues with this code. I am unable to find out the issue with my code
From what I have learned in a spark-summit course, you have to use the sql.functions methods as much as possible. before implementing your own udf you have to check if there's no existing function in the sql.functions package that does the same work. using the existing functions spark can do a lot of optimizations for you and it will not be obliged to serialize and deserialize you data from and to JVM objects.
to achieve the result you want I'm gonna propose this solution :
val oneDF = spark.sparkContext.parallelize(Seq("19931001", "19931001")).toDF("EXP_DATE")
val convertedDF = oneDF.withColumn("modifiedDate", from_unixtime(unix_timestamp($"EXP_DATE", "yyyyMMdd"), "yyyy-MM-dd"))
convertedDF.show()
this gives the following results :
+--------+------------+
|EXP_DATE|modifiedDate|
+--------+------------+
|19931001| 1993-10-01|
|19931001| 1993-10-01|
+--------+------------+
Hope this help. Best Regards