Scala - Selecting minimum values by group - scala

When looking at my input data frame below, what I'm hoping to do is be able to select the timeframe for each month where Diff_from_50 is the lowest. If there are any ties in this value, it should look at the AvgWindSpeed and select which ever has the lowest windspeed.
What would be the best way to do this in Scala? I've been working with the following code, but when I group by Month I lose my other columns. I'm also not exactly sure how to approach comparing the differences in temperature and then select the one with the lowest WindSpeed if there are ties.
Any suggestions/tips would be appreciated.
Current Code:
val oshdata = osh.select(col("TemperatureF"),col("Wind SpeedMPH"), concat(format_string("%02d",col("Month")),lit("/"),format_string("%02d",col("Day")),lit("/"),col("Year"),lit(" "),col("TimeCST")).as("Date")).withColumn("TemperatureF",when(col("TemperatureF").equalTo(-9999),null).otherwise(col("TemperatureF"))).withColumn("Wind SpeedMPH",when(col("Wind SpeedMPH").equalTo(-9999),null).otherwise(col("Wind SpeedMPH"))).withColumn("WindSpeed",when($"Wind SpeedMPH" === "Calm",0).otherwise($"Wind SpeedMPH"))
val ts = to_timestamp($"Date","MM/dd/yyyy hh:mm a")
val Oshmydata=oshdata.withColumn("ts",ts)
val OshgroupByWindow = Oshmydata.groupBy(window(col("ts"), "1 hour")).agg(avg("TemperatureF").as("avgTemp"),avg("WindSpeed").as("AvgWindSpeed")).select("window.start", "window.end", "avgTemp","AvgWindSpeed")
val Oshdaily = OshgroupByWindow.withColumn("_tmp",split($"start"," ")).select($"_tmp".getItem(0).as("Date"),date_format($"_tmp".getItem(1),"hh:mm:ss a").as("startTime"),$"end",$"avgTemp",$"AvgWindSpeed").withColumn("_tmp2",split($"end"," ")).select($"Date",$"StartTime",date_format($"_tmp2".getItem(1),"hh:mm:ss a").as("EndTime"),$"avgTemp",$"AvgWindSpeed").withColumn("Diff_From_50",abs($"avgTemp"-50))
val OshfinalData = Oshdaily.select(col("*"),month(col("Date")).as("Month")).orderBy($"Month",$"StartTime")
OshfinalData.createOrReplaceTempView("oshView")
val testing = OshfinalData.select(col("*")).groupBy($"Month",$"StartTime").agg(avg($"avgTemp").as("avgTemp"),avg($"AvgWindSpeed").as("AvgWindSpeed"))
val withDiff = testing.withColumn("Diff_from_50",abs($"avgTemp"-50))
withDiff.select(col("*")).groupBy($"Month").agg(min("Diff_from_50")).show()
Input Data Frame:
+-----+-----------+------------------+------------------+-------------------+
|Month| StartTime| avgTemp| AvgWindSpeed| Diff_from_50|
+-----+-----------+------------------+------------------+-------------------+
| 1|01:00:00 AM|17.375469072164957| 8.336983230663929| 32.62453092783504|
| 1|01:00:00 PM| 23.70729813664597|10.294075601374567| 26.29270186335403|
| 1|02:00:00 AM| 17.17661058638331| 8.332715559474817| 32.823389413616695|
| 1|02:00:00 PM| 23.78028142954523|10.131929492774708| 26.21971857045477|
| 1|03:00:00 AM|16.979751170960192| 8.305847424684158| 33.02024882903981|
| 1|03:00:00 PM| 23.78028142954523|11.131929492774708| 26.21971857045477|
| 2|01:00:00 AM| 18.19221536796537| 8.104439935064937| 31.80778463203463|
| 2|01:00:00 PM|25.602093162953263|10.756156072520753| 24.397906837046737|
| 2|02:00:00 AM| 17.7650265755505| 8.142266514806375| 32.2349734244495|
| 2|02:00:00 PM|25.602093162953263|11.756156072520753| 24.397906837046737|
+-----+-----------+------------------+------------------+-------------------+
Expected output:
+-----+-----------+------------------+------------------+-------------------+
|Month| StartTime| avgTemp| AvgWindSpeed| Diff_from_50|
+-----+-----------+------------------+------------------+-------------------+
| 1|02:00:00 PM| 23.78028142954523|10.131929492774708| 26.21971857045477|
| 2|01:00:00 PM|25.602093162953263|10.756156072520753| 24.397906837046737|
+-----+-----------+------------------+------------------+-------------------+

You can use the Window function:
val monthsLowest = Window
.partitionBy('Month)
.orderBy('Diff_from_50.asc, 'AvgWindSpeed.asc)
df.withColumn("rn", row_number over monthsLowest)
.where($"rn" === 1)
.drop("rn")
.show()
It will give you the same expected output.
For more information about Window functions in spark, there is a great guide https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-functions-windows.html
You can also take a look at answer:
How to select the first row of each group?

Related

Spark scala dynamically create filter condition value from sequence Pair

I have a spark dataframe : df :
|id | year | month |
-------------------
| 1 | 2020 | 01 |
| 2 | 2019 | 03 |
| 3 | 2020 | 01 |
I have a sequence year_month = Seq[(2019,01),(2020,01),(2021,01)]
val year_map gets genrated dynamically based on code runs everytime
I want to filter the dataframe : df based on the year_month sequence for on ($year=seq[0] & $month = seq[1]) for each value pair in sequence year_month
You can achieve this by
Create a dataframe from year_month
Perform an inner join on year_month with your original dataframe on month and year
Choosing distinct records
The resulting dataframe will be the matched rows
Working Example
Setup
import spark.implicits._
val dfData = Seq((1,2020,1),(2,2019,3),(3,2020,1))
val df = dfData.toDF()
.selectExpr("_1 as id"," _2 as year","_3 as month")
df.createOrReplaceTempView("original_data")
val year_month = Seq((2019,1),(2020,1),(2021,1))
Step 1
// Create Temporary DataFrame
val yearMonthDf = year_month.toDF()
.selectExpr("_1 as year","_2 as month" )
yearMonthDf.createOrReplaceTempView("temp_year_month")
Step 2
var dfResult = spark.sql("select o.id, o.year, o.month from original_data o inner join temp_year_month t on o.year = t.year and o.month = t.month")
Step3
var dfResultDistinct = dfResult.distinct()
Output
dfResultDistinct.show()
+---+----+-----+
| id|year|month|
+---+----+-----+
| 1|2020| 1|
| 3|2020| 1|
+---+----+-----+
NB. If you are interested in finding the similar records that exist irrespective of the id. You could update the spark sql to the following (it has removed o.id)
select
o.year,
o.month
from
original_data o
inner join
temp_year_month t on o.year = t.year and
o.month = t.month
which would give as the result
+----+-----+
|year|month|
+----+-----+
|2020| 1|
+----+-----+

how to update all the values of a column in a dataFrame

I have a data frame which has a non formated Date column :
+--------+-----------+--------+
|CDOPEINT| bbbbbbbbbb| Date|
+--------+-----------+--------+
| AAA|bbbbbbbbbbb|13190326|
| AAA|bbbbbbbbbbb|10190309|
| AAA|bbbbbbbbbbb|36190908|
| AAA|bbbbbbbbbbb|07190214|
| AAA|bbbbbbbbbbb|13190328|
| AAA|bbbbbbbbbbb|23190608|
| AAA|bbbbbbbbbbb|13190330|
| AAA|bbbbbbbbbbb|26190630|
+--------+-----------+--------+
the date column is formated as : wwyyMMdd (week, year, month, day) which I want to format to YYYYMMdd, for that a have a method : format that do that.
so my question is how could I map all the values of column Date to the needed format? here is the output that I want :
+--------+-----------+----------+
|CDOPEINT| bbbbbbbbbb| Date|
+--------+-----------+----------+
| AAA|bbbbbbbbbbb|2019/03/26|
| AAA|bbbbbbbbbbb|2019/03/09|
| AAA|bbbbbbbbbbb|2019/09/08|
| AAA|bbbbbbbbbbb|2019/02/14|
| AAA|bbbbbbbbbbb|2019/03/28|
| AAA|bbbbbbbbbbb|2019/06/08|
| AAA|bbbbbbbbbbb|2019/03/30|
| AAA|bbbbbbbbbbb|2019/06/30|
+--------+-----------+----------+
Spark 2.4.3 using unix_timestamp you can convert data to the expected output.
scala> var df2 =spark.createDataFrame(Seq(("AAA","bbbbbbbbbbb","13190326"),("AAA","bbbbbbbbbbb","10190309"),("AAA","bbbbbbbbbbb","36190908"),("AAA","bbbbbbbbbbb","07190214"),("AAA","bbbbbbbbbbb","13190328"),("AAA","bbbbbbbbbbb","23190608"),("AAA","bbbbbbbbbbb","13190330"),("AAA","bbbbbbbbbbb","26190630"))).toDF("CDOPEINT","bbbbbbbbbb","Date")
scala> df2.withColumn("Date",from_unixtime(unix_timestamp(substring(col("Date"),3,7),"yyMMdd"),"yyyy/MM/dd")).show
+--------+-----------+----------+
|CDOPEINT| bbbbbbbbbb| Date|
+--------+-----------+----------+
| AAA|bbbbbbbbbbb|2019/03/26|
| AAA|bbbbbbbbbbb|2019/03/09|
| AAA|bbbbbbbbbbb|2019/09/08|
| AAA|bbbbbbbbbbb|2019/02/14|
| AAA|bbbbbbbbbbb|2019/03/28|
| AAA|bbbbbbbbbbb|2019/06/08|
| AAA|bbbbbbbbbbb|2019/03/30|
| AAA|bbbbbbbbbbb|2019/06/30|
+--------+-----------+----------+
let me know if you have any query related to this.
If the date involves values from 2000 and the Date column in your original dataframe is of Integer type,you could try something like this
def getDate =(date:Int) =>{
val dateString = date.toString.drop(2).sliding(2,2)
dateString.zipWithIndex.map{
case(value,index) => if(index ==0) 20+value else value
}.mkString("/")
}
Then create a UDF which calls this function
val updateDateUdf = udf(getDate)
If originalDF is the original Dataframe that you have, you could then change the dataframe like this
val updatedDF = originalDF.withColumn("Date",updateDateUdf(col("Date")))

Is there a better way to go about this process of trimming my spark DataFrame appropriately?

In the following example, I want to be able to only take the x Ids with the highest counts. x is number of these I want which is determined by a variable called howMany.
For the following example, given this Dataframe:
+------+--+-----+
|query |Id|count|
+------+--+-----+
|query1|11|2 |
|query1|12|1 |
|query2|13|2 |
|query2|14|1 |
|query3|13|2 |
|query4|12|1 |
|query4|11|1 |
|query5|12|1 |
|query5|11|2 |
|query5|14|1 |
|query5|13|3 |
|query6|15|2 |
|query6|16|1 |
|query7|17|1 |
|query8|18|2 |
|query8|13|3 |
|query8|12|1 |
+------+--+-----+
I would like to get the following dataframe if the variable number is 2.
+------+-------+-----+
|query |Ids |count|
+------+-------+-----+
|query1|[11,12]|2 |
|query2|[13,14]|2 |
|query3|[13] |2 |
|query4|[12,11]|1 |
|query5|[11,13]|2 |
|query6|[15,16]|2 |
|query7|[17] |1 |
|query8|[18,13]|2 |
+------+-------+-----+
I then want to remove the count column, but that is trivial.
I have a way to do this, but I think it defeats the purpose of scala all together and completely wastes a lot of runtime. Being new, I am unsure about the best ways to go about this
My current method is to first get a distinct list of the query column and create an iterator. Second I loop through the list using the iterator and trim the dataframe to only the current query in the list using df.select($"eachColumnName"...).where("query".equalTo(iter.next())). I then .limit(howMany) and then groupBy($"query").agg(collect_list($"Id").as("Ids")). Lastly, I have an empty dataframe and add each of these one by one to the empty dataframe and return this newly created dataframe.
df.select($"query").distinct().rdd.map(r => r(0).asInstanceOf[String]).collect().toList
val iter = queries.toIterator
while (iter.hasNext) {
middleDF = df.select($"query", $"Id", $"count").where($"query".equalTo(iter.next()))
queryDF = middleDF.sort(col("count").desc).limit(howMany).select(col("query"), col("Ids")).groupBy(col("query")).agg(collect_list("Id").as("Ids"))
emptyDF.union(queryDF) // Assuming emptyDF is made
}
emptyDF
I would do this using Window-Functions to get the rank, then groupBy to aggrgate:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val howMany = 2
val newDF = df
.withColumn("rank",row_number().over(Window.partitionBy($"query").orderBy($"count".desc)))
.where($"rank"<=howMany)
.groupBy($"query")
.agg(
collect_list($"Id").as("Ids"),
max($"count").as("count")
)

dataframe spark scala take the (MAX-MIN) for each group

i have a dataframe from a processing part, looks like :
+---------+------+-----------+
|Time |group |value |
+---------+------+-----------+
| 28371| 94| 906|
| 28372| 94| 864|
| 28373| 94| 682|
| 28374| 94| 574|
| 28383| 95| 630|
| 28384| 95| 716|
| 28385| 95| 913|
i would like to take the (value for max time - value for min time) for each group, to have this result :
+------+-----------+
|group | value |
+------+-----------+
| 94| -332|
| 95| 283|
Thank you in advance for the help
df.groupBy("groupCol").agg(max("value")-min("value"))
Based on the question edit by the OP, here is a way to do this in PySpark. The idea is to compute the row numbers in ascending and descending order of time per group and use those values for subtraction.
from pyspark.sql import Window
from pyspark.sql import functions as func
w_asc = Window.partitionBy(df.groupCol).orderBy(df.time)
w_desc = Window.partitionBy(df.groupCol).orderBy(func.desc(df.time))
df = df.withColumn(func.row_number().over(w_asc).alias('rnum_asc')) \
.withColumn(func.row_number().over(w_desc).alias('rnum_desc'))
df.groupBy(df.groupCol) \
.agg((func.max(func.when(df.rnum_desc==1,df.value))-func.max(func.when(df.rnum_asc==1,df.value))).alias('diff')).show()
It would have been easier if window function first_value were available in Spark SQL. A generic way to solve this using SQL is
select distinct groupCol,diff
from (
select t.*
,first_value(val) over(partition by groupCol order by time) -
first_value(val) over(partition by groupCol order by time desc) as diff
from tbl t
) t

Generating all possible combinations from a Data Frame in Apache Spark

I'm trying to do something quite simple where I have 2 arrays that have been converted into a Data Frame, and I want to show all possible combinations. So for example my output at the moment looks something like this:
+-----------+-----------+
| A | B |
+-----------+-----------+
| First | T |
| Second | P |
+-----------|-----------+
However what I'm actually looking for is this:
+-----------+-----------+
| A | B |
+-----------+-----------+
| First | T |
| First | P |
| Second | T |
| Second | P |
+-----------|-----------+
So far I've got some fairly straight forward code to map my arrays into columns but being quite new to using both Scala and Spark I'm not sure how I'd grab all those combinations. Here is what I have so far:
val firstColumnValues = Array("First", "Second")
val secondColumnValues = Array("T", "P")
val xs = Array(firstColumnValues, secondColumnValues).transpose
val mapped = sparkContext.parallelize(xs).map(ys => Row(ys(0), ys(1)))
val df = mapped.toDF("A", "B")
df.show
...
case class Row(first: String, second: String)
Thanks in advance for any help
In Spark 2.3
val firstColumnValues = sc.parallelize(Array("First", "Second")).toDF("A")
val secondColumnValues = sc.parallelize(Array("T", "P")).toDF("B")
val fullouter = firstColumnValues.crossJoin(secondColumnValues).show