I'm working on AWS EMR with Spark version 2.4.7-amzn-1, using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_302).
I wanted to reduce a dataset of my custom case class Item by key, where the key itself is a custom case class. However, the reduceByKey did not work as I expected.
Here are the two classes:
case class Key(
name: String,
color: String
)
case class Item(
name: String,
color: String,
count: Int
) {
def key: Key = Key(name, color)
}
To aggregate, I defined a custom combine function in Item's companion object that just adds up the counts:
object Item {
def combine(i1: Item, i2: Item): Item = i1.copy(count = i1.count + i2.count)
}
Here's my aggregate function:
import org.apache.spark.sql.Dataset
import spark.implicits._
def aggregate(items: Dataset[Item]): Dataset[Item] = items
.rdd
.keyBy(_.key)
.reduceByKey(Item.combine)
.map(_._2)
.toDS
Now if I try to aggregate...
val items: Dataset[Item] = spark.sparkContext.parallelize(
Seq(
Item("Square", "green", 8),
Item("Triangle", "blue", 3),
Item("Square", "green", 5),
Item("Triangle", "blue", 7)
)
).toDS
val aggregated: Dataset[Item] = aggregate(items)
aggregated.show
...the output shows that the dataset has not been reduced:
+--------+-----+-----+
| name|color|count|
+--------+-----+-----+
| Square|green| 8|
| Square|green| 5|
|Triangle| blue| 3|
|Triangle| blue| 7|
+--------+-----+-----+
However, I observed that the aggregation did work, when I changed the order of the 4 items in the sequence, so the outcome is not consistent.
If I change the key from being a case class instance
def key: Key = Key(name, color)
into being a tuple
def key: Tuple2[String, String] = (name, color)
the aggregation works as expected, giving this output:
+--------+-----+-----+
| name|color|count|
+--------+-----+-----+
| Square|green| 13|
|Triangle| blue| 10|
+--------+-----+-----+
So, does reduceByKey in general not (reliably) work with case classes? Is this the expected behavior? Or has this nothing to do with case class vs. tuple and the real cause lies hidden somewhere else?
My Key class seems quite simple to me, so I guess, it's not a hashing or comparing issue. (I could be wrong.)
I also looked at this question reduceByKey using Scala object as key, but there the cause turned out to be a typo, and chrisbtk explicitly stated: "Spark knows how to compare two object even if they do not implement Ordered."
Do I always have to use tuples as keys?
Try using the Dataset API directly:
Having:
import sparkSession.implicits._
import org.apache.spark.sql.Encoders
implicit val key: Encoder[Key] = Encoders.product[Key]
You can do:
items
.groupByKey(_.key)
.reduceGroups(Item.combine)
.map(_._2)
Related
How do I write the below code in typesafe manner in spark scala with Dataset Api:
val schema: StructType = Encoders.product[CaseClass].schema
//read json from a file
val readAsDataSet :CaseClass=sparkSession.read.option("mode",mode).schema(schema).json(path)as[CaseClass]
//below code needs to be written in type safe way:
val someDF= readAsDataSet.withColumn("col1",explode(col("col_to_be_exploded")))
.select(from_unixtime(col("timestamp").divide(1000))
.as("date"), col("col1"))
As someone in the comments said, you can create a Dataset[CaseClass] and do your operations on there. Let's set it up:
import spark.implicits._
case class MyTest (timestamp: Long, col_explode: Seq[String])
val df = Seq(
MyTest(1673850366000L, Seq("some", "strings", "here")),
MyTest(1271850365998L, Seq("pasta", "with", "cream")),
MyTest(611850366000L, Seq("tasty", "food"))
).toDF("timestamp", "col_explode").as[MyTest]
df.show(false)
+-------------+---------------------+
|timestamp |col_explode |
+-------------+---------------------+
|1673850366000|[some, strings, here]|
|1271850365998|[pasta, with, cream] |
|611850366000 |[tasty, food] |
+-------------+---------------------+
Typically, you can do many operations with the map function and the Scala language.
A map function returns the same amount of elements as the input has. The explode function that you're using, however, does not return the same amount of elements. You can implement this behaviour using the flatMap function.
So, using the Scala language and the flatMap function together, you can do something like this:
import java.time.LocalDateTime
import java.time.ZoneOffset
case class Exploded (datetime: String, exploded: String)
val output = df.flatMap{ case MyTest(timestamp, col_explode) =>
col_explode.map( value => {
val date = LocalDateTime.ofEpochSecond(timestamp/1000, 0, ZoneOffset.UTC).toString
Exploded(date, value)
}
)
}
output.show(false)
+-------------------+--------+
|datetime |exploded|
+-------------------+--------+
|2023-01-16T06:26:06|some |
|2023-01-16T06:26:06|strings |
|2023-01-16T06:26:06|here |
|2010-04-21T11:46:05|pasta |
|2010-04-21T11:46:05|with |
|2010-04-21T11:46:05|cream |
|1989-05-22T14:26:06|tasty |
|1989-05-22T14:26:06|food |
+-------------------+--------+
As you see, we've created a second case class called Exploded which we use to type our output dataset. Our output dataset has the following type: org.apache.spark.sql.Dataset[Exploded] so everything is completely type safe.
My goal is to have a spark dataframe that holds each of my Candy objects in a separate row, with their respective properties
+------------------------------------+
main
+------------------------------------+
{"brand":"brand1","name":"snickers"}
+------------------------------------+
{"brand":"brand2","name":"haribo"}
+------------------------------------+
Case class for Proof of concept
case class Candy(
brand: String,
name: String)
val candy1 = Candy("brand1", "snickers")
val candy2 = Candy("brand2", "haribo")
So far I have only managed to put them in the same row with:
import org.json4s.DefaultFormats
import org.json4s.jackson.Serialization.{read, write}
implicit val formats = DefaultFormats
val body = write(Array(candy1, candy2))
val df=Seq(body).toDF("main")
df.show(5, false)
giving me everything in one row instead of 2. How can I split each object up into its own row while maintaining the schema of my Candy object?
+-------------------------------------------------------------------------+
| main |
+-------------------------------------------------------------------------+
|[{"brand":"brand1","name":"snickers"},{"brand":"brand2","name":"haribo"}]|
+-------------------------------------------------------------------------+
Do you want to keep the item as a json string inside the dataframe?
If you don't, you can do this, taking advatange of the dataset ability to handle case classes:
val df=Seq(candy1, candy2).toDS
This will give you the following output:
+------+--------+
| brand| name|
+------+--------+
|brand1|snickers|
|brand2| haribo|
+------+--------+
IMHO that's the best optionm but if you want to keep your data as a json string, then you can first define a toJson method inside your case class:
case class Candy(brand:String, name: String) {
def toJson = s"""{"brand": "$brand", "name": "$name" }"""
}
And then build the DF using that method:
val df=Seq(candy1.toJson, candy2.toJson).toDF("main")
OUTPUT
+----------------------------------------+
|main |
+----------------------------------------+
|{"brand": "brand1", "name": "snickers" }|
|{"brand": "brand2", "name": "haribo" } |
+----------------------------------------+
I am trying to convert a simple DataFrame to a DataSet from the example in Spark:
https://spark.apache.org/docs/latest/sql-programming-guide.html
case class Person(name: String, age: Int)
import spark.implicits._
val path = "examples/src/main/resources/people.json"
val peopleDS = spark.read.json(path).as[Person]
peopleDS.show()
But the following problem arises:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up cast `age` from bigint to int as it may truncate
The type path of the target object is:
- field (class: "scala.Int", name: "age")
- root class: ....
Can anyone help me out?
Edit
I noticed that with Long instead of Int works!
Why is that?
Also:
val primitiveDS = Seq(1,2,3).toDS()
val augmentedDS = primitiveDS.map(i => ("var_" + i.toString, (i + 1).toLong))
augmentedDS.show()
augmentedDS.as[Person].show()
Prints:
+-----+---+
| _1| _2|
+-----+---+
|var_1| 2|
|var_2| 3|
|var_3| 4|
+-----+---+
Exception in thread "main"
org.apache.spark.sql.AnalysisException: cannot resolve '`name`' given input columns: [_1, _2];
Can Anyone Help me out understand here?
If you change Int to Long (or BigInt) it works fine:
case class Person(name: String, age: Long)
import spark.implicits._
val path = "examples/src/main/resources/people.json"
val peopleDS = spark.read.json(path).as[Person]
peopleDS.show()
Output:
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
EDIT:
Spark.read.json by default parses numbers as Long types - it's safer to do so.
You can change the col type after using casting or udfs.
EDIT2:
To answer your 2nd question, you need to name the columns correctly before the conversion to Person will work:
val primitiveDS = Seq(1,2,3).toDS()
val augmentedDS = primitiveDS.map(i => ("var_" + i.toString, (i + 1).toLong)).
withColumnRenamed ("_1", "name" ).
withColumnRenamed ("_2", "age" )
augmentedDS.as[Person].show()
Outputs:
+-----+---+
| name|age|
+-----+---+
|var_1| 2|
|var_2| 3|
|var_3| 4|
+-----+---+
This is how you create dataset from case class
case class Person(name: String, age: Long)
Keep the case class outside of the class that has below code
val primitiveDS = Seq(1,2,3).toDS()
val augmentedDS = primitiveDS.map(i => Person("var_" + i.toString, (i + 1).toLong))
augmentedDS.show()
augmentedDS.as[Person].show()
Hope this helped
I am very much pleased with Spark 2.0 DataSets because of it's compile time type safety. But here is couple of problem that I am not able to work out, I also didn't find good documentation for this.
Problem #1 - divide operation on aggregated column-
Consider below code -
I have a DataSet[MyCaseClass] and I wanted to groupByKey on c1,c2,c3 and sum(c4) / 8. The below code works well if I just calculate the sum but it gives compile time error for divide(8). I wonder how I can achieve following.
final case class MyClass (c1: String,
c2: String,
c3: String,
c4: Double)
val myCaseClass: DataSet[MyCaseClass] = ??? // assume it's being loaded
import sparkSession.implicits._
import org.apache.spark.sql.expressions.scalalang.typed.{sum => typedSum}
myCaseClass.
groupByKey(myCaseClass =>
(myCaseClass.c1, myCaseClass.c2, myCaseClass.c3)).
agg(typedSum[MyCaseClass](_.c4).name("sum(c4)").
divide(8)). //this is breaking with exception
show()
If I remove .divide(8) operation and run above command it gives me below output.
+-----------+-------------+
| key|sum(c4) |
+-----------+-------------+
| [A1,F2,S1]| 80.0|
| [A1,F1,S1]| 40.0|
+-----------+-------------+
Problem #2 - converting groupedByKey result to another Typed DataFrame -
Now second part of my problem is I want output again a typed DataSet. For that I have another case class (not sure if it is needed) but I am not sure how to map with grouped result -
final case class AnotherClass(c1: String,
c2: String,
c3: String,
average: Double)
myCaseClass.
groupByKey(myCaseClass =>
(myCaseClass.c1, myCaseClass.c2, myCaseClass.c3)).
agg(typedSum[MyCaseClass](_.c4).name("sum(c4)")).
as[AnotherClass] //this is breaking with exception
but this again fails with an exception as grouped by key result is not directly mapped with AnotherClass.
PS : any other solution to achieve above is more than welcome.
The first problem can be resolved by using typed columns all the way down (KeyValueGroupedDataset.agg expects TypedColumn(-s))
You can defined aggregation result as:
val eight = lit(8.0)
.as[Double] // Not necessary
val sumByEight = typedSum[MyClass](_.c4)
.divide(eight)
.as[Double] // Required
.name("div(sum(c4), 8)")
and plug it into following code:
val myCaseClass = Seq(
MyClass("a", "b", "c", 2.0),
MyClass("a", "b", "c", 3.0)
).toDS
myCaseClass
.groupByKey(myCaseClass => (myCaseClass.c1, myCaseClass.c2, myCaseClass.c3))
.agg(sumByEight)
to get
+-------+---------------+
| key|div(sum(c4), 8)|
+-------+---------------+
|[a,b,c]| 0.625|
+-------+---------------+
The second problem is a result of using a class which doesn't conform to a data shape. A correct representation could be:
case class AnotherClass(key: (String, String, String), sum: Double)
which used with data defined above:
myCaseClass
.groupByKey(myCaseClass => (myCaseClass.c1, myCaseClass.c2, myCaseClass.c3))
.agg(typedSum[MyClass](_.c4).name("sum"))
.as[AnotherClass]
would give:
+-------+---+
| key|sum|
+-------+---+
|[a,b,c]|5.0|
+-------+---+
but .as[AnotherClass] is not necessary here if Dataset[((String, String, String), Double)] is acceptable.
You can of course skip all of that and just mapGroups (although not without performance penalty):
import shapeless.syntax.std.tuple._ // A little bit of shapeless
val tuples = myCaseClass
.groupByKey(myCaseClass => (myCaseClass.c1, myCaseClass.c2, myCaseClass.c3))
.mapGroups((group, iter) => group :+ iter.map(_.c4).sum)
with result
+---+---+---+---+
| _1| _2| _3| _4|
+---+---+---+---+
| a| b| c|5.0|
+---+---+---+---+
reduceGroups could be a better option:
myCaseClass
.groupByKey(myCaseClass => (myCaseClass.c1, myCaseClass.c2, myCaseClass.c3))
.reduceGroups((x, y) => x.copy(c4=x.c4 + y.c4))
with resulting Dataset:
+-------+-----------+
| _1| _2|
+-------+-----------+
|[a,b,c]|[a,b,c,5.0]|
+-------+-----------+
I'm struggling to understand how to craft Dataset schemas. I have a dataset from an aggregation with the key tuple in one column, and the aggregate in the second:
> ds.show
+------+------+
| _1| _2|
+------+------+
|[96,0]| 93439|
|[69,0]|174386|
|[42,0]| 12427|
|[15,0]| 2090|
|[80,0]| 2626|
|[91,0]| 71963|
|[64,0]| 191|
|[37,0]| 13|
|[48,0]| 13898|
|[21,0]| 2510|
|[59,0]| 1874|
|[32,0]| 373|
| [5,0]| 1075|
|[97,0]| 2|
|[16,0]| 492|
|[11,0]| 34040|
|[76,0]| 4|
|[22,0]| 1216|
|[60,0]| 522|
|[33,0]| 287|
+------+------+
only showing top 20 rows
> ds.schema
StructType(StructField(_1,StructType(StructField(src,IntegerType,false), StructField(dst,IntegerType,false)),true), StructField(_2,LongType,false))
Why can't I apply this schema?
> val mySchema = StructType(StructField("_1",StructType(StructField("src",IntegerType,false),
StructField("dst",IntegerType,false)),true),
StructField("_2",LongType,false))
> ds.as[mySchema]
Name: Compile Error
Message: <console>:41: error: overloaded method value apply with alternatives:
(fields: Array[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType <and>
(fields: java.util.List[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType <and>
(fields: Seq[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType
cannot be applied to (org.apache.spark.sql.types.StructField, org.apache.spark.sql.types.StructField)
val returnSchema = StructType(StructField("_1",StructType(StructField("src",IntegerType,false),
^
I've also unsuccessfully tried to reflect a Scala case class:
> final case class AggSchema(edge: (Int, Int), count: Long)
> ds.as[AggSchema]
Name: org.apache.spark.sql.catalyst.analysis.UnresolvedException
Message: Invalid call to dataType on unresolved object, tree: 'edge
StackTrace: org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:59)
...
The first approach doesn't because schema is not a type. Schema is just an object which describes Catalyst types of the column. In other words schema is just a metadata required to interpret values stored in DataFrame. Without it DataFrame is nothing more than Dataset[Row] and o.a.s.sql.Row is pretty much a Seq[Any].
The second approach doesn't work because names of the fields don't match the schema. Since names of the columns.
case class Edge(src: Int, dst: Int)
val df = Seq((Edge(96, 0), 93439L)).toDF
Either don't use names at all:
df.as[((Int, Int), Long)]
or use schema that matches used case class, for example
case class AggSchema(edge: Edge, count: Long)
df.toDF("edge", "count").as[AggSchema]