Use a set as the key in Dataset#groupByKey - scala

Is there a way to use a Set as the key with Dataset#groupByKey? It looks like, for sets, Spark uses an encoder meant for arrays. This causes the order of values within a set to change the outcome.
Here's an example:
import org.apache.spark.sql._
object Main extends App {
val spark =
SparkSession
.builder
.appName("spark")
.master("local")
.getOrCreate()
import spark.implicits._
println {
List("foo", "bar")
.toDS()
.groupByKey {
case "foo" => Set(1, 2)
case "bar" => Set(2, 1) // append .toList.sorted.toSet to get expected behavior
}
.keys
.collect
.mkString("\n")
}
spark.close()
}
I expect this to produce a single key, Set(1, 2). Instead, it produces two. The encoder looks like it's meant for ordered collections:
val e: Encoder[Set[Int]] = implicitly[Encoder[Set[Int]]]
println(s"${e}") // class[value[0]: array<int>]
Is this a bug? Should there be an encoder for sets? Is that even feasible?

Related

Scala sortbykey and collect function

I am a beginner in Spark in Scala. So I am writing a program where I am reading a CSV file, then I am counting the total spending done by a particular ID number. So after counting the spending, when I am sorting the RDD using sortByKey(), it's not sorting the RDD properly, but after applying collect() it's printing in a proper manner.
Before collect()
(0,5524.9497)
(51,4975.2197)
(1,4958.5996)
(52,5245.0605)
(2,5994.591)
(53,4945.3)
(3,4659.63)
(4,4815.05)
(5,4561.0703)
(6,5397.8794)
(7,4755.0693)
(8,5517.24)
(9,5322.6494)
(10,4819.6997)```
**After Collect**
```(0,5524.9497)
(1,4958.5996)
(2,5994.591)
(3,4659.63)
(4,4815.05)
(5,4561.0703)
(6,5397.8794)
(7,4755.0693)
(8,5517.24)
(9,5322.6494)
(10,4819.6997) ```
**Code**
``` def main(args: Array[String])= {
Logger.getLogger("org").setLevel(Level.ERROR) //Set for displaying errors in the program if any
val sc = new SparkContext("local[*]", "CustomerSpending")
val lines = sc.textFile("../customer-orders.csv")
val field = lines.map(x => (x.split(",")(0).toInt, x.split(",")(2).toFloat))
val collectThemAll = field.reduceByKey((x,y) => x+y)
val sorted = collectThemAll.sortByKey().collect()
sorted.foreach(println)
}
}
Spark applies transformations lazily i.e. only when you call an action like collect or take etc. So your call to sortByKey() is only applied after you call the collect.
I created an App based on your sample data. I printed the RDD dependency using toDebugString so you can get insight into what is happening behind the scenes.
App
import org.apache.spark.sql.SparkSession
object PlaygroundApp extends App {
val spark = SparkSession
.builder()
.appName("Stackoverflow App")
.master("local[*]")
.getOrCreate()
val sc = spark.sparkContext
val lines = sc.parallelize(Seq(
(0, 5524.9497),
(51, 4975.2197),
(1, 4958.5996),
(52, 5245.0605),
(2, 5994.591),
(53, 4945.3),
(9, 5322.6494),
(10, 4819.6997))
)
val collectThemAll = lines.reduceByKey((x, y) => x + y)
println("---Before sort")
collectThemAll.foreach(println)
println(collectThemAll.toDebugString)
println()
println("---After sort")
val sorted = collectThemAll.sortByKey()
sorted.collect().foreach(println)
println(sorted.toDebugString)
}
Output
---Before sort
(2,5994.591)
(53,4945.3)
(0,5524.9497)
(52,5245.0605)
(10,4819.6997)
(9,5322.6494)
(1,4958.5996)
(51,4975.2197)
(12) ShuffledRDD[1] at reduceByKey at PlaygroundApp.scala:28 []
+-(12) ParallelCollectionRDD[0] at parallelize at PlaygroundApp.scala:17 []
---After sort
(0,5524.9497)
(1,4958.5996)
(2,5994.591)
(9,5322.6494)
(10,4819.6997)
(51,4975.2197)
(52,5245.0605)
(53,4945.3)
(8) ShuffledRDD[4] at sortByKey at PlaygroundApp.scala:37 []
+-(12) ShuffledRDD[1] at reduceByKey at PlaygroundApp.scala:28 []
+-(12) ParallelCollectionRDD[0] at parallelize at PlaygroundApp.scala:17 []

scala spark conversion error when creating dataframe

I am a newbie in scala. Please be patient.
I have this code.
import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
import org.apache.spark.ml.clustering.KMeans
import org.apache.spark.ml.evaluation._
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.evaluation.ClusteringEvaluator
// create spark session
implicit val spark = SparkSession.builder().appName("clustering").getOrCreate()
// read file
val fileName = """file:///some_location/head_sessions_sample.csv"""
// create DF from file
val df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load(fileName)
def inputKmeans(df: DataFrame,spark: SparkSession): DataFrame = {
try {
val a = df.select("id", "start_ts", "duration", "ip_dist").map(r => (r.getInt(0), Vectors.dense(r.getDouble(1), r.getDouble(2), r.getDouble(3)))).toDF("id", "features")
a
}
catch {
case e: java.lang.ClassCastException => spark.emptyDataFrame
}
}
val t = inputKmeans(df).filter( _ != null )
t.foreach(r =>
if (r.get(0) != null)
println(r.get(0)))
For the moment, i want to ignore my conversion errors. But somehow, I still have them.
2018-09-24 11:26:22 ERROR Executor:91 - Exception in task 0.0 in stage
4.0 (TID 6) java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Double
I dont think there is any point to give a snapshot of the csv. At this point, i just want to ignore conversion errors.
Any ideas why this is happening?
As mentioned in the comment, the issue is because the values are not Double type.
val a = df.select("id", "start_ts", "duration", "ip_dist").map(r => (r.getInt(0), Vectors.dense(r.getDouble(1), r.getDouble(2), r.getDouble(3)))).toDF("id", "features")
Either cast to the Correct DataType i.e Long Type (you can also provide the Schema explicitly using Case Class and apply the schema to DataFrame).
Or use the VectorAssembler to convert the columns into features. This is easier and recommended approach.
import org.apache.spark.ml.feature.VectorAssembler
def inputKmeans(df: DataFrame,spark: SparkSession): DataFrame = {
val assembler = new VectorAssembler().setInputCols(Array("start_ts", "duration", "ip_dist")).setOutputCol("features")
val output = assembler.transform(df).select("id", "features")
output
}
i think i discovered the problem. the "try catch" is placed at the level of the DF creation, not at the level of the conversion. in consequence, it catches problems related to DF creation, not conversion issues.

Why does query with UDF fail with "Task not serializable" exception?

I have created a UDF and I am trying to apply it on the result of the coalesce inside a join.
Ideally I would like to do this during a join:
def foo(value: Double): Double = {
value / 100
}
val foo = udf(foo _)
df.join(.....)
.withColumn("value",foo(coalesce(new Column("valueA"), new Column("valueB"))))
But I am getting the exception Task not serializable.
Is there a way to work around that?
Use lambda function to make it serializable. This example works fine.
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.functions.coalesce
import org.apache.spark.sql.functions.udf
val central: DataFrame = Seq(
(1, Some(2014)),
(2, null)
).toDF("key", "year1")
val other1: DataFrame = Seq(
(1, 2016),
(2, 2015)
).toDF("key", "year2")
def fooUDF = udf{v: Double => v/100}
val result = central.join(other1, Seq("key"))
.withColumn("value",fooUDF(coalesce(col("year1"), col("year2"))))
But I am getting the exception Task not serializable.
The reason for the infamous "Task not serializable" exception is that def foo(value: Double): Double is a part of an unserializable owning object (perhaps with SparkSession that indirectly references an unserializable SparkContext).
A solution is to define the method as part of a "standalone" object that has no references to unserializable values.
Is there a way to work around that?
See the other answer by #firas.

How do I dynamically create a UDF in Spark?

I have a DataFrame where I want to create multiple UDFs dynamically to determine if certain rows match. I am just testing one example right now. My test code looks like the following.
//create the dataframe
import spark.implicits._
val df = Seq(("t","t"), ("t", "f"), ("f", "t"), ("f", "f")).toDF("n1", "n2")
//create the scala function
def filter(v1: Seq[Any], v2: Seq[String]): Int = {
for (i <- 0 until v1.length) {
if (!v1(i).equals(v2(i))) {
return 0
}
}
return 1
}
//create the udf
import org.apache.spark.sql.functions.udf
val fudf = udf(filter(_: Seq[Any], _: Seq[String]))
//apply the UDF
df.withColumn("filter1", fudf(Seq($"n1"), Seq("t"))).show()
However, when I run the last line, I get the following error.
:30: error: not found: value df
df.withColumn("filter1", fudf($"n1", Seq("t"))).show()
^
:30: error: type mismatch;
found : Seq[String]
required: org.apache.spark.sql.Column
df.withColumn("filter1", fudf($"n1", Seq("t"))).show()
^
Any ideas on what I'm doing wrong? Note, I am on Scala v2.11.x and Spark 2.0.x.
On another note, if we can solve this "dynamic" UDF question/concern, my use case would be to add them to the dataframe. With some test code as follows, it takes forever (it doesn't even finish, I had to ctrl-c to break out). I'm guessing doing a bunch of .withColumn in a for-loop is a bad idea in Spark. If so, please let me know and I'll abandon this approach altogether.
import spark.implicits._
val df = Seq(("t","t"), ("t", "f"), ("f", "t"), ("f", "f")).toDF("n1", "n2")
import org.apache.spark.sql.functions.udf
val fudf = udf( (x: String) => if (x.equals("t")) 1 else 0)
var df2 = df
for (i <- 0 until 10000) {
df2 = df2.withColumn("filter"+i, fudf($"n1"))
}
Enclose "t" in lit()
df.withColumn("filter1", fudf($"n1", Seq(lit("t")))).show()
Try registering UDF on sqlContext.
Spark 2.0 UDF registration

Spark: Sort records in groups?

I have a set of records which I need to:
1) Group by 'date', 'city' and 'kind'
2) Sort every group by 'prize
In my code:
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
object Sort {
case class Record(name:String, day: String, kind: String, city: String, prize:Int)
val recs = Array (
Record("n1", "d1", "k1", "c1", 10),
Record("n1", "d1", "k1", "c1", 9),
Record("n1", "d1", "k1", "c1", 8),
Record("n2", "d2", "k2", "c2", 1),
Record("n2", "d2", "k2", "c2", 2),
Record("n2", "d2", "k2", "c2", 3)
)
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setAppName("Test")
.set("spark.executor.memory", "2g")
val sc = new SparkContext(conf)
val rs = sc.parallelize(recs)
val rsGrp = rs.groupBy(r => (r.day, r.kind, r.city)).map(_._2)
val x = rsGrp.map{r =>
val lst = r.toList
lst.map{e => (e.prize, e)}
}
x.sortByKey()
}
}
When I try to sort group I get an error:
value sortByKey is not a member of org.apache.spark.rdd.RDD[List[(Int,
Sort.Record)]]
What is wrong? How to sort?
You need define a Key and then mapValues to sort them.
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext._
object Sort {
case class Record(name:String, day: String, kind: String, city: String, prize:Int)
// Define your data
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setAppName("Test")
.setMaster("local")
.set("spark.executor.memory", "2g")
val sc = new SparkContext(conf)
val rs = sc.parallelize(recs)
// Generate pair RDD neccesary to call groupByKey and group it
val key: RDD[((String, String, String), Iterable[Record])] = rs.keyBy(r => (r.day, r.city, r.kind)).groupByKey
// Once grouped you need to sort values of each Key
val values: RDD[((String, String, String), List[Record])] = key.mapValues(iter => iter.toList.sortBy(_.prize))
// Print result
values.collect.foreach(println)
}
}
groupByKey is expensive, it has 2 implications:
Majority of the data get shuffled in the remaining N-1 partitions in average.
All of the records of the same key get loaded in memory in the single executor potentially causing memory errors.
Depending of your use case you have different better options:
If you don't care about the ordering, use reduceByKey or aggregateByKey.
If you want to just group and sort without any transformation, prefer using repartitionAndSortWithinPartitions (Spark 1.3.0+ http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.OrderedRDDFunctions) but be very careful of what partitioner you specify and test it because you are now relying on side effects that may change behaviour in a different environment. See also examples in this repository: https://github.com/sryza/aas/blob/master/ch08-geotime/src/main/scala/com/cloudera/datascience/geotime/RunGeoTime.scala.
If you are either applying a transformation or a non reducible aggregation (fold or scan) applied to the iterable of sorted records, then check out this library: spark-sorted https://github.com/tresata/spark-sorted. It provides 3 APIs for paired rdds: mapStreamByKey, foldLeftByKey and scanLeftByKey.
Replace map with flatMap
val x = rsGrp.map{r =>
val lst = r.toList
lst.map{e => (e.prize, e)}
}
this will give you a
org.apache.spark.rdd.RDD[(Int, Record)] = FlatMappedRDD[10]
and then you can call sortBy(_._1) on the RDD above.
As an alternative to #gasparms solution, I think one can try a filter followed by rdd.sortyBy operation. You filter each record that meets key criteria. Pre requisite is that you need to keep track of all your keys(filter combinations). You can also build it as you traverse through records.