Spark: Sort records in groups? - scala

I have a set of records which I need to:
1) Group by 'date', 'city' and 'kind'
2) Sort every group by 'prize
In my code:
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
object Sort {
case class Record(name:String, day: String, kind: String, city: String, prize:Int)
val recs = Array (
Record("n1", "d1", "k1", "c1", 10),
Record("n1", "d1", "k1", "c1", 9),
Record("n1", "d1", "k1", "c1", 8),
Record("n2", "d2", "k2", "c2", 1),
Record("n2", "d2", "k2", "c2", 2),
Record("n2", "d2", "k2", "c2", 3)
)
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setAppName("Test")
.set("spark.executor.memory", "2g")
val sc = new SparkContext(conf)
val rs = sc.parallelize(recs)
val rsGrp = rs.groupBy(r => (r.day, r.kind, r.city)).map(_._2)
val x = rsGrp.map{r =>
val lst = r.toList
lst.map{e => (e.prize, e)}
}
x.sortByKey()
}
}
When I try to sort group I get an error:
value sortByKey is not a member of org.apache.spark.rdd.RDD[List[(Int,
Sort.Record)]]
What is wrong? How to sort?

You need define a Key and then mapValues to sort them.
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext._
object Sort {
case class Record(name:String, day: String, kind: String, city: String, prize:Int)
// Define your data
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setAppName("Test")
.setMaster("local")
.set("spark.executor.memory", "2g")
val sc = new SparkContext(conf)
val rs = sc.parallelize(recs)
// Generate pair RDD neccesary to call groupByKey and group it
val key: RDD[((String, String, String), Iterable[Record])] = rs.keyBy(r => (r.day, r.city, r.kind)).groupByKey
// Once grouped you need to sort values of each Key
val values: RDD[((String, String, String), List[Record])] = key.mapValues(iter => iter.toList.sortBy(_.prize))
// Print result
values.collect.foreach(println)
}
}

groupByKey is expensive, it has 2 implications:
Majority of the data get shuffled in the remaining N-1 partitions in average.
All of the records of the same key get loaded in memory in the single executor potentially causing memory errors.
Depending of your use case you have different better options:
If you don't care about the ordering, use reduceByKey or aggregateByKey.
If you want to just group and sort without any transformation, prefer using repartitionAndSortWithinPartitions (Spark 1.3.0+ http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.OrderedRDDFunctions) but be very careful of what partitioner you specify and test it because you are now relying on side effects that may change behaviour in a different environment. See also examples in this repository: https://github.com/sryza/aas/blob/master/ch08-geotime/src/main/scala/com/cloudera/datascience/geotime/RunGeoTime.scala.
If you are either applying a transformation or a non reducible aggregation (fold or scan) applied to the iterable of sorted records, then check out this library: spark-sorted https://github.com/tresata/spark-sorted. It provides 3 APIs for paired rdds: mapStreamByKey, foldLeftByKey and scanLeftByKey.

Replace map with flatMap
val x = rsGrp.map{r =>
val lst = r.toList
lst.map{e => (e.prize, e)}
}
this will give you a
org.apache.spark.rdd.RDD[(Int, Record)] = FlatMappedRDD[10]
and then you can call sortBy(_._1) on the RDD above.

As an alternative to #gasparms solution, I think one can try a filter followed by rdd.sortyBy operation. You filter each record that meets key criteria. Pre requisite is that you need to keep track of all your keys(filter combinations). You can also build it as you traverse through records.

Related

Spark Scala: build a new column using a function using another dataframe

Here's my issue: I have a first dataframe which is basically a list of cities, and the country they reside in. I have a second dataframe, with a list of users, and the cities they reside in.
I'd like to add a "country" column to the second dataframe, where its value would be based on the "city" column of course, but the city names can me typed differently (for example Washington and washington would both have to give me USA).
I though the best way to do that would be to create a foo(country: String) : String which would return the country by parsing the first dataframe, but I can't find a way to use this function while creating my new column.
first put in lower case the city column of both dataframes, since you go to join on key city and after, effect the capitalize of the first letter, this code should do what you are looking for:
object Main {
def main(args: Array[String]): Unit = {
val sparkSession: SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate()
import sparkSession.implicits._
val citiesDF = Seq(
("London", "England"), ("Washington", "USA")
)
.toDF("city", "country")
.withColumn("city", lower(col("city")))
val usersDF = Seq(
("Andy", "London"), ("Mark", "Washington"), ("Bob", "washington")
)
.toDF("name", "city")
.withColumn("city", lower(col("city")))
val resultDF = citiesDF.join(usersDF, Seq("city"))
.withColumn("city", initcap(col("city")))
resultDF.show()
}
}

Use a set as the key in Dataset#groupByKey

Is there a way to use a Set as the key with Dataset#groupByKey? It looks like, for sets, Spark uses an encoder meant for arrays. This causes the order of values within a set to change the outcome.
Here's an example:
import org.apache.spark.sql._
object Main extends App {
val spark =
SparkSession
.builder
.appName("spark")
.master("local")
.getOrCreate()
import spark.implicits._
println {
List("foo", "bar")
.toDS()
.groupByKey {
case "foo" => Set(1, 2)
case "bar" => Set(2, 1) // append .toList.sorted.toSet to get expected behavior
}
.keys
.collect
.mkString("\n")
}
spark.close()
}
I expect this to produce a single key, Set(1, 2). Instead, it produces two. The encoder looks like it's meant for ordered collections:
val e: Encoder[Set[Int]] = implicitly[Encoder[Set[Int]]]
println(s"${e}") // class[value[0]: array<int>]
Is this a bug? Should there be an encoder for sets? Is that even feasible?

Scala sortbykey and collect function

I am a beginner in Spark in Scala. So I am writing a program where I am reading a CSV file, then I am counting the total spending done by a particular ID number. So after counting the spending, when I am sorting the RDD using sortByKey(), it's not sorting the RDD properly, but after applying collect() it's printing in a proper manner.
Before collect()
(0,5524.9497)
(51,4975.2197)
(1,4958.5996)
(52,5245.0605)
(2,5994.591)
(53,4945.3)
(3,4659.63)
(4,4815.05)
(5,4561.0703)
(6,5397.8794)
(7,4755.0693)
(8,5517.24)
(9,5322.6494)
(10,4819.6997)```
**After Collect**
```(0,5524.9497)
(1,4958.5996)
(2,5994.591)
(3,4659.63)
(4,4815.05)
(5,4561.0703)
(6,5397.8794)
(7,4755.0693)
(8,5517.24)
(9,5322.6494)
(10,4819.6997) ```
**Code**
``` def main(args: Array[String])= {
Logger.getLogger("org").setLevel(Level.ERROR) //Set for displaying errors in the program if any
val sc = new SparkContext("local[*]", "CustomerSpending")
val lines = sc.textFile("../customer-orders.csv")
val field = lines.map(x => (x.split(",")(0).toInt, x.split(",")(2).toFloat))
val collectThemAll = field.reduceByKey((x,y) => x+y)
val sorted = collectThemAll.sortByKey().collect()
sorted.foreach(println)
}
}
Spark applies transformations lazily i.e. only when you call an action like collect or take etc. So your call to sortByKey() is only applied after you call the collect.
I created an App based on your sample data. I printed the RDD dependency using toDebugString so you can get insight into what is happening behind the scenes.
App
import org.apache.spark.sql.SparkSession
object PlaygroundApp extends App {
val spark = SparkSession
.builder()
.appName("Stackoverflow App")
.master("local[*]")
.getOrCreate()
val sc = spark.sparkContext
val lines = sc.parallelize(Seq(
(0, 5524.9497),
(51, 4975.2197),
(1, 4958.5996),
(52, 5245.0605),
(2, 5994.591),
(53, 4945.3),
(9, 5322.6494),
(10, 4819.6997))
)
val collectThemAll = lines.reduceByKey((x, y) => x + y)
println("---Before sort")
collectThemAll.foreach(println)
println(collectThemAll.toDebugString)
println()
println("---After sort")
val sorted = collectThemAll.sortByKey()
sorted.collect().foreach(println)
println(sorted.toDebugString)
}
Output
---Before sort
(2,5994.591)
(53,4945.3)
(0,5524.9497)
(52,5245.0605)
(10,4819.6997)
(9,5322.6494)
(1,4958.5996)
(51,4975.2197)
(12) ShuffledRDD[1] at reduceByKey at PlaygroundApp.scala:28 []
+-(12) ParallelCollectionRDD[0] at parallelize at PlaygroundApp.scala:17 []
---After sort
(0,5524.9497)
(1,4958.5996)
(2,5994.591)
(9,5322.6494)
(10,4819.6997)
(51,4975.2197)
(52,5245.0605)
(53,4945.3)
(8) ShuffledRDD[4] at sortByKey at PlaygroundApp.scala:37 []
+-(12) ShuffledRDD[1] at reduceByKey at PlaygroundApp.scala:28 []
+-(12) ParallelCollectionRDD[0] at parallelize at PlaygroundApp.scala:17 []

How can I map a function on dataFrame column values which returns a dataFrame?

I have a spark DataFrame, df1, which contains several columns, one of them is with IDs of patients. I want to take this column and perform a function that sends http request for information regarding every ID, say medical test. This information is then parsed from json and returned by the function as DataFrame of multiple tests. I want to do this for all the IDs so that I have a second DataFrame, df2, with all medical tests information for the IDs in df1.
I tried the following code, which I think is not optimal especially for large number of patients. My problem is that I cannot handle the results in the form of Array[org.apache.spark.sql.DataFrame]. Note this is a sample code, in real life I might have 100 tests for one ID and only 3 for another.
import scala.util.Random._
val df1 = Seq(
("9031x", 32),
("1102z", 12),
("3048o", 54)
).toDF("ID", "age")
// a function that takes the string and returns a DataFrame
def getPatientInfo(ID: String): org.apache.spark.sql.DataFrame = {
val r = scala.util.Random
val df2 = Seq(
("test1", r.nextInt(100), r.nextInt(40)+1980, r.nextString(4)),
("test2", r.nextInt(100), r.nextInt(40)+1980, r.nextString(3)),
("test3", r.nextInt(100), r.nextInt(40)+1980, r.nextString(5))
).toDF("testName", "year", "result", "Notes")
df2
}
// convert the ID to Array[String]
val ID = df1.collect().map(row => row.getString(0))
// apply the function foreach ID
val medicalRecords = for (i <- ID) yield {getPatientInfo(i)}
Are there any other optimal approaches?
TL;DR;
It is not possible DataFrame.map (or equivalent method) cannot use SparkSession or distributed data structures.
If you want make it work, use your favorite JSON parser instead and redefine getPatient as either:
def getPatientInfo(ID: String): Seq[Row]
or
def getPatientInfo(ID: String): T
where T is a case class and replace:
df1.flatMap(row => getPatientInfo(row.getString(0)))
(adding Encoder if necessary).

Dataframe state before save and after load - what's different?

I have a DF that contains some SQL expressions (coalesce, case/when etc.).
I later try to map/flatMap this DF where I get an Task not serializable error, due to the fields that contain the SQL expressions.
(Why I need to map/flatMap this DF is a separate question)
When I save this DF to a Parquet file and load it afterwards, the error is gone and I can convert to RDD and do transformations no problem!
How is the DF different before saving and after loading? In some way, the SQL expressions must have been evaluated and made persistent. How can I achieve the same thing without saving/loading? (df.perists() did not do the trick ;()
Here's test code:
val data = Seq( (1, "sku1", "EUR", 99.0, 89.0), (2, "sku2", "USD", 89.0, 79.0), (3, "sku3", "USD", 49.0, 39.9) )
val aditionalStuffForCertainSKUsMap = Map("sku1" -> List(10, 20, 30))
val listedPrice = coalesce(
List("EUR", "USD").map(c => when($"currency" === c, col(c)).otherwise(lit(null))): _*)
val df = (sc.parallelize(data)
.toDF("id", "sku", "currency", "EUR", "USD")
.withColumn("price_in_given_currency", when($"currency" === "EUR", $"EUR"*2).otherwise(1))
// .withColumn("fails_price_in_given_currency", listedPrice)
)
df.show
df.write.mode(SaveMode.Overwrite).parquet("test_df")
The data contains a sku and some SKUs represent bundles, like sku1, for which I want to add some other fields to the DF. Only when I try to access this Map[String, List[Int]] within the map() I get complaints with the fails_price_in_given_currency column, not so with the price_in_given_currency:
// If I load the df first, the map() works even when using `fails_price_in_given_currency`
//val df = sqlContext.read.parquet("test_df")
val out = df.map(d => {
val key = d.getAs[String]("sku")
aditionalStuffForCertainSKUsMap.getOrElse(key, None)
})
The error is thrown when I use fails_price_in_given_currency instead. If I however load df before the map, it will run again!