Dataframe state before save and after load - what's different? - scala

I have a DF that contains some SQL expressions (coalesce, case/when etc.).
I later try to map/flatMap this DF where I get an Task not serializable error, due to the fields that contain the SQL expressions.
(Why I need to map/flatMap this DF is a separate question)
When I save this DF to a Parquet file and load it afterwards, the error is gone and I can convert to RDD and do transformations no problem!
How is the DF different before saving and after loading? In some way, the SQL expressions must have been evaluated and made persistent. How can I achieve the same thing without saving/loading? (df.perists() did not do the trick ;()
Here's test code:
val data = Seq( (1, "sku1", "EUR", 99.0, 89.0), (2, "sku2", "USD", 89.0, 79.0), (3, "sku3", "USD", 49.0, 39.9) )
val aditionalStuffForCertainSKUsMap = Map("sku1" -> List(10, 20, 30))
val listedPrice = coalesce(
List("EUR", "USD").map(c => when($"currency" === c, col(c)).otherwise(lit(null))): _*)
val df = (sc.parallelize(data)
.toDF("id", "sku", "currency", "EUR", "USD")
.withColumn("price_in_given_currency", when($"currency" === "EUR", $"EUR"*2).otherwise(1))
// .withColumn("fails_price_in_given_currency", listedPrice)
)
df.show
df.write.mode(SaveMode.Overwrite).parquet("test_df")
The data contains a sku and some SKUs represent bundles, like sku1, for which I want to add some other fields to the DF. Only when I try to access this Map[String, List[Int]] within the map() I get complaints with the fails_price_in_given_currency column, not so with the price_in_given_currency:
// If I load the df first, the map() works even when using `fails_price_in_given_currency`
//val df = sqlContext.read.parquet("test_df")
val out = df.map(d => {
val key = d.getAs[String]("sku")
aditionalStuffForCertainSKUsMap.getOrElse(key, None)
})
The error is thrown when I use fails_price_in_given_currency instead. If I however load df before the map, it will run again!

Related

Collecting unique elements during Spark aggregation

Problem
I need to update this line in my code. How do I do that?
"case StringType => concat_ws(",",collect_list(col(c)))"
To only append strings that are not already in the existing field. In this example, the letter "b" would not appear twice.
Code
val df =Seq(
(1, 1.0, true, "a"),
(2, 2.0, false, "b")
(3, 2.0, false, "b")
(3, 2.0, false, "c")
).toDF("id","d","b","s")
val dataTypes: Map[String, DataType] = df.schema.map(sf =>
(sf.name,sf.dataType)).toMap
def genericAgg(c:String) = {
dataTypes(c) match {
case DoubleType => sum(col(c))
case StringType => concat_ws(",",collect_list(col(c)))
case BooleanType => max(col(c))
}
}
val aggExprs: Seq[Column] = df.columns.filterNot(_=="id")
.map(c => genericAgg(c))
df
.groupBy("id")
.agg(
aggExprs.head,aggExprs.tail:_*
)
.show()
You probably want to use collect_set() instead of collect_list(). This will automatically remove duplicates during the collection.
I am not sure why you want to turn the array of unique strings into a comma-delimited list. Spark can easily handle array columns and they are displayed such that each element can be seen. Still, if you absolutely must have the array converted into a comma-delimited string, use array_join in Spark 2.4+ or a UDF in earlier versions of Spark.

Spark Structured Streaming MemoryStream + Row + Encoders issue

I am trying to run some tests on my local machine with spark structured streaming.
In batch mode here is the Row that i am dealing with:
val recordSchema = StructType(List(StructField("Record", MapType(StringType, StringType), false)))
val rows = List(
Row(
Map("ID" -> "1",
"STRUCTUREID" -> "MFCD00869853",
"MOLFILE" -> "The MOL Data",
"MOLWEIGHT" -> "803.482",
"FORMULA" -> "C44H69NO12",
"NAME" -> "Tacrolimus",
"HASH" -> "52b966c551cfe0fa7d526bac16abcb7be8b8867d",
"SMILES" -> """[H][C#]12O[C#](O)([C#H](C)C[C##H]1OC)""",
"METABOLISM" -> "The metabolism 500"
)),
Row(
Map("ID" -> "2",
"STRUCTUREID" -> "MFCD00869854",
"MOLFILE" -> "The MOL Data",
"MOLWEIGHT" -> "603.482",
"FORMULA" -> "",
"NAME" -> "Tacrolimus2",
"HASH" -> "52b966c551cfe0fa7d526bac16abcb7be8b8867d",
"SMILES" -> """[H][C#]12O[C#](O)([C#H](C)C[C##H]1OC)""",
"METABOLISM" -> "The metabolism 500"
))
)
val df = spark.createDataFrame(spark.sparkContext.parallelize(rows), recordSchema)
Working with that in Batch more works as a charm, no issue.
Now I'm try to move in streaming mode using MemoryStream for testing. I added the following:
implicit val ctx = spark.sqlContext
val intsInput = MemoryStream[Row]
But the compiler complain with the as follows:
No implicits found for parameter evidence$1: Encoder[Row]
Hence, my question: What should I do here to get that working
Also i saw that if I add the following import the error goes away:
import spark.implicits._
Actually, I now get the following warning instead of an error
Ambiguous implicits for parameter evidence$1: Encoder[Row]
I do not understand the encoder mechanism well and would appreciate if someone could explain to me how not to use those implicits. The reason being that I red the following in a book when it comes to the creation of DataFrame from Rows.
Recommended appraoch:
val myManualSchema = new StructType(Array(
new StructField("some", StringType, true),
new StructField("col", StringType, true),
new StructField("names", LongType, false)))
val myRows = Seq(Row("Hello", null, 1L))
val myRDD = spark.sparkContext.parallelize(myRows)
val myDf = spark.createDataFrame(myRDD, myManualSchema)
myDf.show()
And then the author goes on with this:
In Scala, we can also take advantage of Spark’s implicits in the
console (and if you import them in your JAR code) by running toDF on a
Seq type. This does not play well with null types, so it’s not
necessarily recommended for production use cases.
val myDF = Seq(("Hello", 2, 1L)).toDF("col1", "col2", "col3")
If someone could take the time to explain what is happening in my scenario when i use the implicit, and if it is rather safe to do so, or else is there a way to do it more explicitly without importing the implicit.
Finally, if someone could point me to a good doc around Encoder and Spark Type mapping that would be great.
EDIT1
I finally got it to work with
implicit val ctx = spark.sqlContext
import spark.implicits._
val rows = MemoryStream[Map[String,String]]
val df = rows.toDF()
Although my problem here is that i am not confident about what I am doing. It seems to me that it is like in some situation I need to create a DataSet to be able to convert it in an DF[ROW] with toDF conversion. I understood that working with DS is typeSafe but slower than with DF. So why this intermediary with DataSet? This is not the first time that i see that in Spark Structured Streaming. Again if someone could help me with those, that would be great.
I encourage you to use Scala's case classes for data modeling.
final case class Product(name: String, catalogNumber: String, cas: String, formula: String, weight: Double, mld: String)
Now you can have a List of Product in memory:
val inMemoryRecords: List[Product] = List(
Product("Cyclohexanecarboxylic acid", " D19706", "1148027-03-5", "C(11)H(13)Cl(2)NO(5)", 310.131, "MFCD11226417"),
Product("Tacrolimus", "G51159", "104987-11-3", "C(44)H(69)NO(12)", 804.018, "MFCD00869853"),
Product("Methanol", "T57494", "173310-45-7", "C(8)H(8)Cl(2)O", 191.055, "MFCD27756662")
)
The structured streaming API makes it easy to reason about stream processing by using the widely known Dataset[T] abstraction. Roughly speaking, you just have to worry about three things:
Source: a source can generate an input data stream which we can represent as a Dataset[Input]. Every new data item Input that arrives is going to be appended into this unbounded dataset. You can manipulate the data as you wish (e.g. Dataset[Input] => Dataset[Output]).
StreamingQueries and Sink: a query generates a result table that's updated from the Source every trigger interval. Changes are written into external storage called a Sink.
Output modes: there are different modes on which you can write data into the Sink: complete mode, append mode, and update mode.
Let's assume that you want to know the products that contain a molecular weight bigger than 200 units.
As you said, using the batch API is fairly simple and straight-forward:
// Create an static dataset using the in-memory data
val staticData: Dataset[Product] = spark.createDataset(inMemoryRecords)
// Processing...
val result: Dataset[Product] = staticData.filter(_.weight > 200)
// Print results!
result.show()
When using the Streaming API you just need to define a source and a sink as an extra step. In this example, we can use the MemoryStream and the console sink to print out the results.
// Create an streaming dataset using the in-memory data (memory source)
val productSource = MemoryStream[Product]
productSource.addData(inMemoryRecords)
val streamingData: Dataset[Product] = productSource.toDS()
// Processing...
val result: Dataset[Product] = streamingData.filter(_.weight > 200)
// Print results by using the console sink.
val query: StreamingQuery = result.writeStream.format("console").start()
// Stop streaming
query.awaitTermination(timeoutMs=5000)
query.stop()
Note that the staticData and the streamingData have the exact type signature (i.e., Dataset[Product]). This allows us to apply the same processing steps regardless of using the Batch or Streaming API. You can also think of implementing a generic method def processing[In, Out](inputData: Dataset[In]): Dataset[Out] = ??? to avoid repeating yourself in both approaches.
Complete code example:
object ExMemoryStream extends App {
// Boilerplate code...
val spark: SparkSession = SparkSession.builder
.appName("ExMemoryStreaming")
.master("local[*]")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
implicit val sqlContext: SQLContext = spark.sqlContext
// Define your data models
final case class Product(name: String, catalogNumber: String, cas: String, formula: String, weight: Double, mld: String)
// Create some in-memory instances
val inMemoryRecords: List[Product] = List(
Product("Cyclohexanecarboxylic acid", " D19706", "1148027-03-5", "C(11)H(13)Cl(2)NO(5)", 310.131, "MFCD11226417"),
Product("Tacrolimus", "G51159", "104987-11-3", "C(44)H(69)NO(12)", 804.018, "MFCD00869853"),
Product("Methanol", "T57494", "173310-45-7", "C(8)H(8)Cl(2)O", 191.055, "MFCD27756662")
)
// Defining processing step
def processing(inputData: Dataset[Product]): Dataset[Product] =
inputData.filter(_.weight > 200)
// STATIC DATASET
val datasetStatic: Dataset[Product] = spark.createDataset(inMemoryRecords)
println("This is the static dataset:")
processing(datasetStatic).show()
// STREAMING DATASET
val productSource = MemoryStream[Product]
productSource.addData(inMemoryRecords)
val datasetStreaming: Dataset[Product] = productSource.toDS()
println("This is the streaming dataset:")
val query: StreamingQuery = processing(datasetStreaming).writeStream.format("console").start()
query.awaitTermination(timeoutMs=5000)
// Stop query and close Spark
query.stop()
spark.close()
}

How can I map a function on dataFrame column values which returns a dataFrame?

I have a spark DataFrame, df1, which contains several columns, one of them is with IDs of patients. I want to take this column and perform a function that sends http request for information regarding every ID, say medical test. This information is then parsed from json and returned by the function as DataFrame of multiple tests. I want to do this for all the IDs so that I have a second DataFrame, df2, with all medical tests information for the IDs in df1.
I tried the following code, which I think is not optimal especially for large number of patients. My problem is that I cannot handle the results in the form of Array[org.apache.spark.sql.DataFrame]. Note this is a sample code, in real life I might have 100 tests for one ID and only 3 for another.
import scala.util.Random._
val df1 = Seq(
("9031x", 32),
("1102z", 12),
("3048o", 54)
).toDF("ID", "age")
// a function that takes the string and returns a DataFrame
def getPatientInfo(ID: String): org.apache.spark.sql.DataFrame = {
val r = scala.util.Random
val df2 = Seq(
("test1", r.nextInt(100), r.nextInt(40)+1980, r.nextString(4)),
("test2", r.nextInt(100), r.nextInt(40)+1980, r.nextString(3)),
("test3", r.nextInt(100), r.nextInt(40)+1980, r.nextString(5))
).toDF("testName", "year", "result", "Notes")
df2
}
// convert the ID to Array[String]
val ID = df1.collect().map(row => row.getString(0))
// apply the function foreach ID
val medicalRecords = for (i <- ID) yield {getPatientInfo(i)}
Are there any other optimal approaches?
TL;DR;
It is not possible DataFrame.map (or equivalent method) cannot use SparkSession or distributed data structures.
If you want make it work, use your favorite JSON parser instead and redefine getPatient as either:
def getPatientInfo(ID: String): Seq[Row]
or
def getPatientInfo(ID: String): T
where T is a case class and replace:
df1.flatMap(row => getPatientInfo(row.getString(0)))
(adding Encoder if necessary).

Spark, Scala - column type determine

I can load data from database, and I do some process with this data.
The problem is some table has date column as 'String', but some others trait it as 'timestamp'.
I cannot know what type of date column is until loading data.
> x.getAs[String]("date") // could be error when date column is timestamp type
> x.getAs[Timestamp]("date") // could be error when date column is string type
This is how I load data from spark.
spark.read
.format("jdbc")
.option("url", url)
.option("dbtable", table)
.option("user", user)
.option("password", password)
.load()
Is there any way to trait them together? or convert it as string always?
You can pattern-match on the type of the column (using the DataFrame's schema) to decide whether to parse the String into a Timestamp or just use the Timestamp as is - and use the unix_timestamp function to do the actual conversion:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StringType
// preparing some example data - df1 with String type and df2 with Timestamp type
val df1 = Seq(("a", "2016-02-01"), ("b", "2016-02-02")).toDF("key", "date")
val df2 = Seq(
("a", new Timestamp(new SimpleDateFormat("yyyy-MM-dd").parse("2016-02-01").getTime)),
("b", new Timestamp(new SimpleDateFormat("yyyy-MM-dd").parse("2016-02-02").getTime))
).toDF("key", "date")
// If column is String, converts it to Timestamp
def normalizeDate(df: DataFrame): DataFrame = {
df.schema("date").dataType match {
case StringType => df.withColumn("date", unix_timestamp($"date", "yyyy-MM-dd").cast("timestamp"))
case _ => df
}
}
// after "normalizing", you can assume date has Timestamp type -
// both would print the same thing:
normalizeDate(df1).rdd.map(r => r.getAs[Timestamp]("date")).foreach(println)
normalizeDate(df2).rdd.map(r => r.getAs[Timestamp]("date")).foreach(println)
Here are a few things you can try:
(1) Start utilizing the inferSchema function during load if you have a version that supports it. This will have spark figure the data type of columns, this doesn't work in all scenarios. Also look at the input data, if you have quotes I advise adding an extra argument to account for them during the load.
val inputDF = spark.read.format("csv").option("header","true").option("inferSchema","true").load(fileLocation)
(2) To identify the data type of a column you can use the below code, it will place all of the column name and data types into their own Arrays of Strings.
val columnNames : Array[String] = inputDF.columns
val columnDataTypes : Array[String] = inputDF.schema.fields.map(x=>x.dataType).map(x=>x.toString)
It has a easy way to address this which is get(i: Int): Any. And it will be map between Spark SQL types and return types automatically. e.g.
val fieldIndex = row.fieldIndex("date")
val date = row.get(fieldIndex)
def parseLocationColumn(df: DataFrame): DataFrame = {
df.schema("location").dataType match {
case StringType => df.withColumn("locationTemp", $"location")
.withColumn("countryTemp", lit("Unknown"))
.withColumn("regionTemp", lit("Unknown"))
.withColumn("zoneTemp", lit("Unknown"))
case _ => df.withColumn("locationTemp", $"location.location")
.withColumn("countryTemp", $"location.country")
.withColumn("regionTemp", $"location.region")
.withColumn("zoneTemp", $"location.zone")
}
}

Spark: Sort records in groups?

I have a set of records which I need to:
1) Group by 'date', 'city' and 'kind'
2) Sort every group by 'prize
In my code:
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
object Sort {
case class Record(name:String, day: String, kind: String, city: String, prize:Int)
val recs = Array (
Record("n1", "d1", "k1", "c1", 10),
Record("n1", "d1", "k1", "c1", 9),
Record("n1", "d1", "k1", "c1", 8),
Record("n2", "d2", "k2", "c2", 1),
Record("n2", "d2", "k2", "c2", 2),
Record("n2", "d2", "k2", "c2", 3)
)
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setAppName("Test")
.set("spark.executor.memory", "2g")
val sc = new SparkContext(conf)
val rs = sc.parallelize(recs)
val rsGrp = rs.groupBy(r => (r.day, r.kind, r.city)).map(_._2)
val x = rsGrp.map{r =>
val lst = r.toList
lst.map{e => (e.prize, e)}
}
x.sortByKey()
}
}
When I try to sort group I get an error:
value sortByKey is not a member of org.apache.spark.rdd.RDD[List[(Int,
Sort.Record)]]
What is wrong? How to sort?
You need define a Key and then mapValues to sort them.
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext._
object Sort {
case class Record(name:String, day: String, kind: String, city: String, prize:Int)
// Define your data
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setAppName("Test")
.setMaster("local")
.set("spark.executor.memory", "2g")
val sc = new SparkContext(conf)
val rs = sc.parallelize(recs)
// Generate pair RDD neccesary to call groupByKey and group it
val key: RDD[((String, String, String), Iterable[Record])] = rs.keyBy(r => (r.day, r.city, r.kind)).groupByKey
// Once grouped you need to sort values of each Key
val values: RDD[((String, String, String), List[Record])] = key.mapValues(iter => iter.toList.sortBy(_.prize))
// Print result
values.collect.foreach(println)
}
}
groupByKey is expensive, it has 2 implications:
Majority of the data get shuffled in the remaining N-1 partitions in average.
All of the records of the same key get loaded in memory in the single executor potentially causing memory errors.
Depending of your use case you have different better options:
If you don't care about the ordering, use reduceByKey or aggregateByKey.
If you want to just group and sort without any transformation, prefer using repartitionAndSortWithinPartitions (Spark 1.3.0+ http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.OrderedRDDFunctions) but be very careful of what partitioner you specify and test it because you are now relying on side effects that may change behaviour in a different environment. See also examples in this repository: https://github.com/sryza/aas/blob/master/ch08-geotime/src/main/scala/com/cloudera/datascience/geotime/RunGeoTime.scala.
If you are either applying a transformation or a non reducible aggregation (fold or scan) applied to the iterable of sorted records, then check out this library: spark-sorted https://github.com/tresata/spark-sorted. It provides 3 APIs for paired rdds: mapStreamByKey, foldLeftByKey and scanLeftByKey.
Replace map with flatMap
val x = rsGrp.map{r =>
val lst = r.toList
lst.map{e => (e.prize, e)}
}
this will give you a
org.apache.spark.rdd.RDD[(Int, Record)] = FlatMappedRDD[10]
and then you can call sortBy(_._1) on the RDD above.
As an alternative to #gasparms solution, I think one can try a filter followed by rdd.sortyBy operation. You filter each record that meets key criteria. Pre requisite is that you need to keep track of all your keys(filter combinations). You can also build it as you traverse through records.