Aggregate dataset by key and list the rest of values - scala

I am trying to achieve aggregation by key and getting the list as aggregation.Not sure if i am able to explain.I am new to these spark features.
case class myScheme(classId: Int,
StudentId: Long,
joiningDate: java.sql.Timestamp,
SchoolName: String
)
println( myDS)
classId | studentId| joiningDate |SchoolName
1|100|2010-01-01|"School1"
2|110|2011-01-01|"School1"
2|200|2010-01-01|"School1"
3|300|2020-02-02|"School2"
I want to group all student Ids by schoolName and classId.
The final result from the above dataset:
"school1" ,1, [100 ]
"school2" ,2, [110,200]
"school3" ,3, [300]

Your current input and expected result do not make much sense, but it seems that you need a simple groupBy with collect_list:
myDS.groupBy("SchoolName", "classId").agg(collect_list("studentId")).show
+----------+-------+-----------------------+
|SchoolName|classId|collect_list(studentId)|
+----------+-------+-----------------------+
| School1| 1| [100]|
| School2| 3| [300]|
| School1| 2| [110, 200]|
+----------+-------+-----------------------+

If the data is already available in form of a (typed) dataset, you can use groupByKey to stay within the typed API:
//input data
val ds: Dataset[myScheme] = ...
//case class for the result
case class Group(SchoolName: String,
classId: Int,
data: List[myScheme])
//the grouping operation
val dsGrouped: Dataset[Group] =
ds.groupByKey(r=> (r.SchoolName, r.classId))
.mapGroups((key, values) => Group(key._1, key._2, values.toList))
dsGrouped.show(false)
Output:
+----------+-------+--------------------------------------------------------------------------------+
|SchoolName|classId|data |
+----------+-------+--------------------------------------------------------------------------------+
|School1 |1 |[{1, 100, 2010-01-01 00:00:00, School1}] |
|School2 |3 |[{3, 300, 2020-02-02 00:00:00, School2}] |
|School1 |2 |[{2, 110, 2011-01-01 00:00:00, School1}, {2, 200, 2001-01-01 00:00:00, School1}]|
+----------+-------+--------------------------------------------------------------------------------+

Related

How to assign a category to each row based on the cumulative sum of values in spark dataframe?

I have a spark dataframe consist of two columns [Employee and Salary] where salary is in ascending order.
Sample Dataframe
Expected Output:
| Employee |salary |
| -------- | ------|
| Emp1 | 10 |
| Emp2 | 20 |
| Emp3 | 30 |
| EMp4 | 35 |
| Emp5 | 36 |
| Emp6 | 50 |
| Emp7 | 70 |
I want to group the rows such that each group has less than 80 as the aggregated value and assign a category to each group something like this. I will keep adding the salary in rows until the sum becomes more than 80. As soon as it becomes more than 80, I will asssign a new category.
Expected Output:
| Employee |salary | Category|
| -------- | ------|----------
| Emp1 | 10 |A |
| Emp2 | 20 |A |
| Emp3 | 30 |A |
| EMp4 | 35 |B |
| Emp5 | 36 |B |
| Emp6 | 50 |C |
| Emp7 | 70 |D |
Is there a simple way we can do this in spark scala?
To solve your problem, you can use a custom aggregate function over a window
First, you need to create your custom aggregate function. An aggregate function is defined by an accumulator (a buffer), that will be initialized (zero value) and updated when treating a new row (reduce function) or encountering another accumulator (merge function). And at the end, the accumulator is returned (finish function)
In your case, accumulator should keep two pieces of information:
Current category of employees
Sum of salaries of previous employees belonging to the current category
To store those information, you can use a Tuple (Int, Int), with first element is current category and second element the sum of salaries of previous employees of current category:
You initialize this tuple with (0, 0).
When you encounter a new row, if the sum of previous salaries and salary of current row is over 80, you increment category and reinitialize previous salaries' sum with salary of current row, else you add salary of current row to previous salaries' sum.
As you will be using a window function, you will sequentially treat rows so you don't need to implement merge with another accumulator.
And at the end, as you only want the category, you return only the first element of the accumulator.
So we get the following aggregator implementation:
import org.apache.spark.sql.{Encoder, Encoders}
import org.apache.spark.sql.expressions.Aggregator
object Labeler extends Aggregator[Int, (Int, Int), Int] {
override def zero: (Int, Int) = (0, 0)
override def reduce(catAndSum: (Int, Int), salary: Int): (Int, Int) = {
if (catAndSum._2 + salary > 80)
(catAndSum._1 + 1, salary)
else
(catAndSum._1, catAndSum._2 + salary)
}
override def merge(catAndSum1: (Int, Int), catAndSum2: (Int, Int)): (Int, Int) = {
throw new NotImplementedError("should be used only over a windows function")
}
override def finish(catAndSum: (Int, Int)): Int = catAndSum._1
override def bufferEncoder: Encoder[(Int, Int)] = Encoders.tuple(Encoders.scalaInt, Encoders.scalaInt)
override def outputEncoder: Encoder[Int] = Encoders.scalaInt
}
Once you have your aggregator, you transform it to a spark aggregate function using udaf function.
You then create your window over all dataframe and ordered by salary and apply your spark aggregate function over this window:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{col, udaf}
val labeler = udaf(Labeler)
val window = Window.orderBy("salary")
val result = dataframe.withColumn("category", labeler(col("salary")).over(window))
Using your example as input dataframe, you get the following result dataframe:
+--------+------+--------+
|employee|salary|category|
+--------+------+--------+
|Emp1 |10 |0 |
|Emp2 |20 |0 |
|Emp3 |30 |0 |
|Emp4 |35 |1 |
|Emp5 |36 |1 |
|Emp6 |50 |2 |
|Emp7 |70 |3 |
+--------+------+--------+

Spark Dataframe filldown

I would like to do a "filldown" type operation on a dataframe in order to remove nulls and make sure the last row is a kind of summary row, containing the last known values for each column based on the timestamp, grouped by the itemId. As I'm using Azure Synapse Notebooks the language can be Scala, Pyspark, SparkSQL or even c#. However the problem here is that the real solution has up to millions of rows and hundreds of columns, so I need a dynamic solution that can take advantage of Spark. We can provision a big cluster to how to make sure we take good advantage of it?
Sample data:
// Assign sample data to dataframe
val df = Seq(
( 1, "10/01/2021", 1, "abc", null ),
( 2, "11/01/2021", 1, null, "bbb" ),
( 3, "12/01/2021", 1, "ccc", null ),
( 4, "13/01/2021", 1, null, "ddd" ),
( 5, "10/01/2021", 2, "eee", "fff" ),
( 6, "11/01/2021", 2, null, null ),
( 7, "12/01/2021", 2, null, null )
).
toDF("eventId", "timestamp", "itemId", "attrib1", "attrib2")
df.show
Expected results with rows 4 and 7 as summary rows:
+-------+----------+------+-------+-------+
|eventId| timestamp|itemId|attrib1|attrib2|
+-------+----------+------+-------+-------+
| 1|10/01/2021| 1| abc| null|
| 2|11/01/2021| 1| abc| bbb|
| 3|12/01/2021| 1| ccc| bbb|
| 4|13/01/2021| 1| ccc| ddd|
| 5|10/01/2021| 2| eee| fff|
| 6|11/01/2021| 2| eee| fff|
| 7|12/01/2021| 2| eee| fff|
+-------+----------+------+-------+-------+
I have reviewed this option but had trouble adapting it for my use case.
Spark / Scala: forward fill with last observation
I have a kind of working SparkSQL solution but it will be very verbose for the high volume of columns, hoping for something easier to maintain:
%%sql
WITH cte (
SELECT
eventId,
itemId,
ROW_NUMBER() OVER( PARTITION BY itemId ORDER BY timestamp ) AS rn,
attrib1,
attrib2
FROM df
)
SELECT
eventId,
itemId,
CASE rn WHEN 1 THEN attrib1
ELSE COALESCE( attrib1, LAST_VALUE(attrib1, true) OVER( PARTITION BY itemId ) )
END AS attrib1_xlast,
CASE rn WHEN 1 THEN attrib2
ELSE COALESCE( attrib2, LAST_VALUE(attrib2, true) OVER( PARTITION BY itemId ) )
END AS attrib2_xlast
FROM cte
ORDER BY eventId
For many columns you could create an expression as below
val window = Window.partitionBy($"itemId").orderBy($"timestamp")
// Instead of selecting columns you could create a list of columns
val expr = df.columns
.map(c => coalesce(col(c), last(col(c), true).over(window)).as(c))
df.select(expr: _*).show(false)
Update:
val mainColumns = df.columns.filterNot(_.startsWith("attrib"))
val aggColumns = df.columns.diff(mainColumns).map(c => coalesce(col(c), last(col(c), true).over(window)).as(c))
df.select(( mainColumns.map(col) ++ aggColumns): _*).show(false)
Result:
+-------+----------+------+-------+-------+
|eventId|timestamp |itemId|attrib1|attrib2|
+-------+----------+------+-------+-------+
|1 |10/01/2021|1 |abc |null |
|2 |11/01/2021|1 |abc |bbb |
|3 |12/01/2021|1 |ccc |bbb |
|4 |13/01/2021|1 |ccc |ddd |
|5 |10/01/2021|2 |eee |fff |
|6 |11/01/2021|2 |eee |fff |
|7 |12/01/2021|2 |eee |fff |
+-------+----------+------+-------+-------+

How to add a Seq[T] column to a Dataset that contains elements of two Datasets?

I have two Datasets AccountData and CustomerData, with the corresponding case classes:
case class AccountData(customerId: String, forename: String, surname: String)
customerId|accountId|balance|
+----------+---------+-------+
| IND0002| ACC0002| 200|
| IND0002| ACC0022| 300|
| IND0003| ACC0003| 400|
+----------+---------+-------+
case class CustomerData(customerId: String, accountId: String, balance: Long)
+----------+-----------+--------+
|customerId| forename| surname|
+----------+-----------+--------+
| IND0001|Christopher| Black|
| IND0002| Madeleine| Kerr|
| IND0003| Sarah| Skinner|
+----------+-----------+--------+
How do I derive the following Dataset, which add column accounts that contains Seq[AccountData] of each customerId?
+----------+-----------+----------------------------------------------+
|customerId|forename |surname |accounts |
+----------+-----------+----------+---------------------------------- +
|IND0001 |Christopher|Black |[]
|IND0002 |Madeleine |Kerr |[[IND0002,ACC002,200],[IND0002,ACC0022,300]]
|IND0003 |Sarah |Skinner |[[IND0003,ACC003,400]
I've tried:
val joinCustomerAndAccount = accountDS.joinWith(customerDS, customerDS("customerId") === accountDS("customerId")).drop(col("_2"))
which gives me the following Dataframe:
+---------------------+
|_1 |
+---------------------+
|[IND0002,ACC0002,200]|
|[IND0002,ACC0022,300]|
|[IND0003,ACC0003,400]|
+---------------------+
If I then do:
val result = customerDS.withColumn("accounts", joinCustomerAndAccount("_1")(0))
I get the following Exception:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Field name should be String Literal, but it's 0;
Accounts can be grouped by "customerId" and joined with Customers:
// data
val accountDS = Seq(
AccountData("IND0002", "ACC0002", 200),
AccountData("IND0002", "ACC0022", 300),
AccountData("IND0003", "ACC0003", 400)
).toDS()
val customerDS = Seq(
CustomerData("IND0001", "Christopher", "Black"),
CustomerData("IND0002", "Madeleine", "Kerr"),
CustomerData("IND0003", "Sarah", "Skinner")
).toDS()
// action
val accountsGroupedDF = accountDS.toDF
.groupBy("customerId")
.agg(collect_set(struct("accountId", "balance")).as("accounts"))
val result = customerDS.toDF.alias("c")
.join(accountsGroupedDF.alias("a"), $"c.customerId" === $"a.customerId", "left")
.select("c.*","accounts")
result.show(false)
Output:
+----------+-----------+-------+--------------------------------+
|customerId|forename |surname|accounts |
+----------+-----------+-------+--------------------------------+
|IND0001 |Christopher|Black |null |
|IND0002 |Madeleine |Kerr |[[ACC0002, 200], [ACC0022, 300]]|
|IND0003 |Sarah |Skinner|[[ACC0003, 400]] |
+----------+-----------+-------+--------------------------------+

How to compare two columns data in Spark Dataframes using Scala

I want to compare two columns in a Spark DataFrame: if the value of a column (attr_value) is found in values of another (attr_valuelist) I want only that value to be kept. Otherwise, the column value should be null.
For example, given the following input
id1 id2 attrname attr_value attr_valuelist
1 2 test Yes Yes, No
2 1 test1 No Yes, No
3 2 test2 value1 val1, Value1,value2
I would expect the following output
id1 id2 attrname attr_value attr_valuelist
1 2 test Yes Yes
2 1 test1 No No
3 2 test2 value1 Value1
I assume, given your sample input, that the column with the search item contains a string while the search target is a sequence of strings. Also, I assume you're interested in case-insensitive search.
This is going to be the input (I added a column that would have yielded a null to test the behavior of the UDF I wrote):
+---+---+--------+----------+----------------------+
|id1|id2|attrname|attr_value|attr_valuelist |
+---+---+--------+----------+----------------------+
|1 |2 |test |Yes |[Yes, No] |
|2 |1 |test1 |No |[Yes, No] |
|3 |2 |test2 |value1 |[val1, Value1, value2]|
|3 |2 |test2 |value1 |[val1, value2] |
+---+---+--------+----------+----------------------+
You can solve your problem with a very simple UDF.
val find = udf {
(item: String, collection: Seq[String]) =>
collection.find(_.toLowerCase == item.toLowerCase)
}
val df = spark.createDataFrame(Seq(
(1, 2, "test", "Yes", Seq("Yes", "No")),
(2, 1, "test1", "No", Seq("Yes", "No")),
(3, 2, "test2", "value1", Seq("val1", "Value1", "value2")),
(3, 2, "test2", "value1", Seq("val1", "value2"))
)).toDF("id1", "id2", "attrname", "attr_value", "attr_valuelist")
df.select(
$"id1", $"id2", $"attrname", $"attr_value",
find($"attr_value", $"attr_valuelist") as "attr_valuelist")
showing the output of the last command would yield the following output:
+---+---+--------+----------+--------------+
|id1|id2|attrname|attr_value|attr_valuelist|
+---+---+--------+----------+--------------+
| 1| 2| test| Yes| Yes|
| 2| 1| test1| No| No|
| 3| 2| test2| value1| Value1|
| 3| 2| test2| value1| null|
+---+---+--------+----------+--------------+
You can execute this code in any spark-shell. If you are using this from a job you are submitting to a cluster, remember to import spark.implicits._.
can you try this code. I think it will work with that SQL contains case when.
val emptyRDD = sc.emptyRDD[Row]
var emptyDataframe = sqlContext.createDataFrame(emptyRDD, your_dataframe.schema)
your_dataframe.createOrReplaceTempView("tbl")
emptyDataframe = sqlContext.sql("select id1, id2, attrname, attr_value, case when
attr_valuelist like concat('%', attr_value, '%') then attr_value else
null end as attr_valuelist from tbl")
emptyDataframe.show

Find columns with different values

My dataframe has 120 columns.Suppose my dataframe has the below structure
Id value1 value2 value3
a 10 1983 19
a 20 1983 20
a 10 1983 21
b 10 1984 1
b 10 1984 2
we can see here the id a, value1 have different values(10,20). I have to find columns having the different values for a particular id. Is there any statistical or any other approach in spark to solve this problem?
Expected output
id new_column
a value1,value3
b value3
The following code might be a start of an answer:
val result = log.select("Id","value1","value2","value3").groupBy('Id).agg('Id, countDistinct('value1),countDistinct('value2),countDistinct('value3))
Should do the following:
1)
log.select("Id","value1","value2","value3")
select relevant columns (if you want to take all columns it might be redundant)
2)
groupBy('Id)
group rows with the same ID
3)
agg('Id, countDistinct('value1),countDistinct('value2),countDistinct('value3))
output : ID, and number(count) of unique(distinct) values per ID/specific column
You can do it in several ways, one of them being the distinct method, that is similar to the SQL behaviour. Another one would be the groupBy method, where you have to pass in parameters the name of the columns you want to group (e.g. df.groupBy("Id", "value1")).
Below is an example using the distinct method.
scala> case class Person(name : String, age: Int)
defined class Person
scala> val persons = Seq(Person("test", 10), Person("test", 20), Person("test", 10)).toDF
persons: org.apache.spark.sql.DataFrame = [name: string, age: int]
scala> persons.show
+----+---+
|name|age|
+----+---+
|test| 10|
|test| 20|
|test| 10|
+----+---+
scala> persons.select("name", "age").distinct().show
+-----+---+
| name|age|
+-----+---+
| test| 10|
| test| 20|
+-----+---+