Spark - Select sum and all columns of joined dataset - scala

I have 2 tables Employees (Id, Name), EmployeeSalary (EmployeeId, Designation, Salary). One employee can hold multiple designations in the company and have multiple salaries. How do I get EmployeeId, Name, Sum of salaries, Seq of all designations.
What I tried so far is
employeeDS.join(employeeSalaryDS, employeeDS.col("Id")
.equalTo(employeeSalaryDS.col("EmployeeId")),"left_outer")
.groupBy(employeeDS.col("Id")).agg(sum("Salary") as "Sum of salaries")

Something like this
scala> val dfe = Seq((101,"John"),(102,"Mike"), (103,"Paul"), (104,"Tom")).toDF("id","name")
dfe: org.apache.spark.sql.DataFrame = [id: int, name: string]
scala> val dfes = Seq((101,"Dev", 4000),(102,"Designer", 4000),(102,"Architect", 5000), (103,"Designer",6000), (104,"Consultant",8000), (104,"Supervisor",9000), (104,"PM",10000) ).toDF("id","desig","salary")
dfes: org.apache.spark.sql.DataFrame = [id: int, desig: string ... 1 more field]
scala> dfe.join(dfes, dfe.col("id").equalTo(dfes.col("id")),"left_outer").groupBy(dfe.col("Id")).agg(sum("Salary") as "Sum of salaries", collect_list('desig as "desig_list")).show(false)
+---+---------------+-----------------------------------+
|Id |Sum of salaries|collect_list(desig AS `desig_list`)|
+---+---------------+-----------------------------------+
|101|4000 |[Dev] |
|103|6000 |[Designer] |
|102|9000 |[Architect, Designer] |
|104|27000 |[PM, Supervisor, Consultant] |
+---+---------------+-----------------------------------+
scala>

Related

Serialize table to nested JSON using Apache Spark SQL

The question is the same as this one but can this:
df.withColumn("VEHICLE",struct("VEHICLENUMBER","CUSTOMERID")).
select("VEHICLE","ACCOUNTNO"). //only select reqired columns
groupBy("ACCOUNTNO").
agg(collect_list("VEHICLE").as("VEHICLE")). //for the same group create a list of vehicles
toJSON. //convert to json
show(false)
be rewritten with pure SQL? I mean something like that:
val sqlDF = spark.sql("SELECT VEHICLE, ACCOUNTNO as collect_list(ACCOUNTNO) FROM VEHICLES group by ACCOUNTNO)
sqlDF.show()
Does it possible?
The SQL equivalent of your dataframe example would be:
scala> val df = Seq((10003014,"MH43AJ411",20000000),
| (10003014,"MH43AJ411",20000001),
| (10003015,"MH12GZ3392",20000002)
| ).toDF("ACCOUNTNO","VEHICLENUMBER","CUSTOMERID").withColumn("VEHICLE",struct("VEHICLENUMBER","CUSTOMERID"))
df: org.apache.spark.sql.DataFrame = [ACCOUNTNO: int, VEHICLENUMBER: string ... 2 more fields]
scala> df.registerTempTable("vehicles")
scala> val sqlDF = spark.sql("SELECT ACCOUNTNO, collect_list(VEHICLE) as ACCOUNT_LIST FROM VEHICLES group by ACCOUNTNO").toJSON
sqlDF: org.apache.spark.sql.Dataset[String] = [value: string]
scala> sqlDF.show(false)
+-----------------------------------------------------------------------------------------------------------------------------------------------+
|value |
+-----------------------------------------------------------------------------------------------------------------------------------------------+
|{"ACCOUNTNO":10003014,"ACCOUNT_LIST":[{"VEHICLENUMBER":"MH43AJ411","CUSTOMERID":20000000},{"VEHICLENUMBER":"MH43AJ411","CUSTOMERID":20000001}]}|
|{"ACCOUNTNO":10003015,"ACCOUNT_LIST":[{"VEHICLENUMBER":"MH12GZ3392","CUSTOMERID":20000002}]} |
+-----------------------------------------------------------------------------------------------------------------------------------------------+

Apache Spark SQL query and DataFrame as reference data

I have two Spark DataFrames:
cities DataFrame with the following column:
city
-----
London
Austin
bigCities DataFrame with the following column:
name
------
London
Cairo
I need to transform DataFrame cities and add an additional Boolean column there: bigCity Value of this column must be calculated based on the following condition "cities.city IN bigCities.name"
I can do this in the following way(with a static bigCities collection):
cities.createOrReplaceTempView("cities")
var resultDf = spark.sql("SELECT city, CASE WHEN city IN ['London', 'Cairo'] THEN 'Y' ELSE 'N' END AS bigCity FROM cities")
but I don't know how to replace the static bigCities collection ['London', 'Cairo'] with bigCities DataFrame in the query. I want to use bigCities as the reference data in the query.
Please advise how to achieve this.
val df = cities.join(bigCities, $"name".equalTo($"city"), "leftouter").
withColumn("bigCity", when($"name".isNull, "N").otherwise("Y")).
drop("name")
You can use collect_list() on the the bigCities table. Check this out
scala> val df_city = Seq(("London"),("Austin")).toDF("city")
df_city: org.apache.spark.sql.DataFrame = [city: string]
scala> val df_bigCities = Seq(("London"),("Cairo")).toDF("name")
df_bigCities: org.apache.spark.sql.DataFrame = [name: string]
scala> df_city.createOrReplaceTempView("cities")
scala> df_bigCities.createOrReplaceTempView("bigCities")
scala> spark.sql(" select city, case when array_contains((select collect_list(name) from bigcities),city) then 'Y' else 'N' end as bigCity from cities").show(false)
+------+-------+
|city |bigCity|
+------+-------+
|London|Y |
|Austin|N |
+------+-------+
scala>
If the dataset is big, you can use collect_set which will be more efficient.
scala> spark.sql(" select city, case when array_contains((select collect_set(name) from bigcities),city) then 'Y' else 'N' end as bigCity from cities").show(false)
+------+-------+
|city |bigCity|
+------+-------+
|London|Y |
|Austin|N |
+------+-------+
scala>

Selecting several columns from spark dataframe with a list of columns as a start

Assuming that I have a list of spark columns and a spark dataframe df, what is the appropriate snippet of code in order to select a subdataframe containing only the columns in the list?
Something similar to maybe:
var needed_column: List[Column]=List[Column](new Column("a"),new Column("b"))
df(needed_columns)
I wanted to get the columns names then select them using the following line of code.
Unfortunately, the column name seems to be in write mode only.
df.select(needed_columns.head.as(String),needed_columns.tail: _*)
Your needed_columns is of type List[Column], hence you can simply use needed_columns: _* as the arguments for select:
val df = Seq((1, "x", 10.0), (2, "y", 20.0)).toDF("a", "b", "c")
import org.apache.spark.sql.Column
val needed_columns: List[Column] = List(new Column("a"), new Column("b"))
df.select(needed_columns: _*)
// +---+---+
// | a| b|
// +---+---+
// | 1| x|
// | 2| y|
// +---+---+
Note that select takes two types of arguments:
def select(cols: Column*): DataFrame
def select(col: String, cols: String*): DataFrame
If you have a list of column names of String type, you can use the latter select:
val needed_col_names: List[String] = List("a", "b")
df.select(needed_col_names.head, needed_col_names.tail: _*)
Or, you can map the list of Strings to Columns to use the former select
df.select(needed_col_names.map(col): _*)
I understand that you want to select only those columns from a list(A)other than the dataframe columns. I have a below example, where I select the firstname and lastname using a separate list. check this out
scala> val df = Seq((101,"Jack", "wright" , 27, "01976", "US")).toDF("id","fname","lname","age","zip","country")
df: org.apache.spark.sql.DataFrame = [id: int, fname: string ... 4 more fields]
scala> df.columns
res20: Array[String] = Array(id, fname, lname, age, zip, country)
scala> val needed =Seq("fname","lname")
needed: Seq[String] = List(fname, lname)
scala> val needed_df = needed.map( x=> col(x) )
needed_df: Seq[org.apache.spark.sql.Column] = List(fname, lname)
scala> df.select(needed_df:_*).show(false)
+-----+------+
|fname|lname |
+-----+------+
|Jack |wright|
+-----+------+
scala>

Sample a different number of random rows for every group in a dataframe in spark scala

The goal is to sample (without replacement) a different number of rows in a dataframe for every group. The number of rows to sample for a specific group is in another dataframe.
Example: idDF is the dataframe to sample from. The groups are denoted by the ID column. The dataframe, planDF specifies the number of rows to sample for each group where "datesToUse" denotes the number of rows, and "ID" denotes the group. "totalDates" is the total number of rows for that group and may or may not be useful.
The final result should have 3 rows sampled from the first group (ID 1), 2 rows sampled from the second group (ID 2) and 1 row sampled from the third group (ID 3).
val idDF = Seq(
(1, "2017-10-03"),
(1, "2017-10-22"),
(1, "2017-11-01"),
(1, "2017-10-02"),
(1, "2017-10-09"),
(1, "2017-12-24"),
(1, "2017-10-20"),
(2, "2017-11-17"),
(2, "2017-11-12"),
(2, "2017-12-02"),
(2, "2017-10-03"),
(3, "2017-12-18"),
(3, "2017-11-21"),
(3, "2017-12-13"),
(3, "2017-10-08"),
(3, "2017-10-16"),
(3, "2017-12-04")
).toDF("ID", "date")
val planDF = Seq(
(1, 3, 7),
(2, 2, 4),
(3, 1, 6)
).toDF("ID", "datesToUse", "totalDates")
this is an example of what a resultant dataframe should look like:
+---+----------+
| ID| date|
+---+----------+
| 1|2017-10-22|
| 1|2017-11-01|
| 1|2017-10-20|
| 2|2017-11-12|
| 2|2017-10-03|
| 3|2017-10-16|
+---+----------+
So far, I tried to use the sample method for DataFrame: https://spark.apache.org/docs/1.5.0/api/java/org/apache/spark/sql/DataFrame.html
Here is an example that would work for an entire data frame.
def sampleDF(DF: DataFrame, datesToUse: Int, totalDates: Int): DataFrame = {
val fraction = datesToUse/totalDates.toFloat.toDouble
DF.sample(false, fraction)
}
I cant figure out how to use something like this for each group. I tried joining the planDF table to the idDF table and using a window partition.
Another idea I had was to somehow make a new column with randomly labeled True / false and then filter on that column.
Another option staying entirely in Dataframes would be to compute probabilities using your planDF, join with idDF, append a column of random numbers and then filter. Helpfully, sql.functions has a rand function.
import org.apache.spark.sql.functions._
import spark.implicits._
val probabilities = planDF.withColumn("prob", $"datesToUse" / $"totalDates")
val dfWithProbs = idDF.join(probabilities, Seq("ID"))
.withColumn("rand", rand())
.where($"rand" < $"prob")
(You'll want to double check that that isn't integer division.)
With the assumption that your planDF is small enough to be collected, you can use Scala's foldLeft to traverse the id list and accumulate the sample Dataframes per id:
import org.apache.spark.sql.{Row, DataFrame}
def sampleByIdDF(DF: DataFrame, id: Int, datesToUse: Int, totalDates: Int): DataFrame = {
val fraction = datesToUse.toDouble / totalDates
DF.where($"id" === id ).sample(false, fraction)
}
val emptyDF = Seq.empty[(Int, String)].toDF("ID", "date")
val planList = planDF.rdd.collect.map{ case Row(x: Int, y: Int, z: Int) => (x, y, z) }
// planList: Array[(Int, Int, Int)] = Array((1,3,7), (2,2,4), (3,1,6))
planList.foldLeft( emptyDF ){
case (accDF: DataFrame, (id: Int, num: Int, total: Int)) =>
accDF union sampleByIdDF(idDF, id, num, total)
}
// res1: org.apache.spark.sql.DataFrame = [ID: int, date: string]
// res1.show
// +---+----------+
// | ID| date|
// +---+----------+
// | 1|2017-10-03|
// | 1|2017-11-01|
// | 1|2017-10-02|
// | 1|2017-12-24|
// | 1|2017-10-20|
// | 2|2017-11-17|
// | 2|2017-11-12|
// | 2|2017-12-02|
// | 3|2017-11-21|
// | 3|2017-12-13|
// +---+----------+
Note that method sample() does not necessarily generate the exact number of samples specified in the method arguments. Here's a relevant SO Q&A.
If your planDF is large, you might have to consider using RDD's aggregate, which has the following signature (skipping the implicit argument):
def aggregate[U](zeroValue: U)(seqOp: (U, T) ⇒ U, combOp: (U, U) ⇒ U): U
It works somewhat like foldLeft, except that it has one accumulation operator within a partition and an additional one to comine results from different partitions.

Spark dataframe explode pair of lists

My dataframe has 2 columns which look like this:
col_id| col_name
-----------
id1 | name1
id2 | name2
------------
id3 | name3
id4 | name4
....
so for each row, there are 2 matching arrays of the same length in columns id and name. What I want is to get each pair of id/name as a separate row like:
col_id| col_name
-----------
id1 | name1
-----------
id2 | name2
....
explode seems like the function to use but I can't seem to get it to work. What I tried is:
rdd.explode(col("col_id"), col("col_name")) ({
case row: Row =>
val ids: java.util.List[String] = row.getList(0)
val names: java.util.List[String] = row.getList(1)
var res: Array[(String, String)] = new Array[(String, String)](ids.size)
for (i <- 0 until ids.size) {
res :+ (ids.get(i), names.get(i))
}
res
})
This however returns only nulls so it might just be my poor knowledge of Scala. Can anyone point out the issue?
Looks like the last 10mins out of the past 1-2hours did the trick lol. This works just fine:
df.explode(col("id"), col("name")) ({
case row: Row =>
val ids: List[String] = row.getList(0).asScala.toList
val names: List[String] = row.getList(1).asScala.toList
ids zip names
})