Find minimum and maximum of year and month in spark scala - scala

I would like to find minimum of year and month and maximum of year and month from spark dataframe. Below is my dataframe
code year month
xx 2004 1
xx 2004 2
xxx 2004 3
xx 2004 6
xx 2011 12
xx 2018 10
I want minimum month and Year as 2004-1 and maximum month and year as 2018-10
The solution which i tried is
val minAnMaxYearAndMonth = dataSet.agg(min(Year),max(Month)).head()
val minYear = minAnMaxYearAndMonth(0)
val maxYear = minAnMaxYearAndMonth(1)
val minMonth = dataSet.select(Month).where(col(Year) === minYear).take(1)
val maxMonth = dataSet.select(Month).where(col(Year) === maxYear).take(1)
getting minYear and MaxYear but not min and max Month. Please help

You could use struct to make tuples out of years and months and then rely on tuple ordering. Tuples are ordered primarily by the leftmost component and then using next component as a tie-breaker.
df.select(struct("year", "month") as "ym")
.agg(min("ym") as "min", max("ym") as "max")
.selectExpr("stack(2, 'min', min.*, 'max', max.*) as (agg, year, month)")
.show()
Output:
+---+----+-----+
|agg|year|month|
+---+----+-----+
|min|2004| 1|
|max|2018| 10|
+---+----+-----+

Related

How to convert short date D-MMM-yyyy using PySpark

Why just the Jan works when try to convert using the code below?
df2 = spark.createDataFrame([["05-Nov-2000"], ["02-Jan-2021"]], ["date"])
df2 = df2.withColumn("date", to_date(col("date"), "D-MMM-yyyy"))
display(df2)
Result:
Date
------------
undefined
2021-01-02
D is a day of year.
The first one works because 02 is in fact in January, but 05 is not in November.
If you try:
data = [{"date": "05-Jan-2000"}, {"date": "02-Jan-2021"}]
It will work for both.
However, you need d which is the day of the month. So use d-MMM-yyyy.
For further information please see: https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
D is day-of-the-year.
What you're looking for is d - day of the month.
PySpark supports the Java DateTimeFormatter patterns: https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/time/format/DateTimeFormatter.html
df2 = spark.createDataFrame([["05-Nov-2000"], ["02-Jan-2021"]], ["date"])
df2 = df2.withColumn("date", to_date(col("date"), "dd-MMM-yyyy"))
df2.show()
+----------+
| date|
+----------+
|2000-11-05|
|2021-01-02|
+----------+

Spark : aggregate values by a list of given years

I'm new to Scala, say I have a dataset :
>>> ds.show()
+--------------+-----------------+-------------+
|year |nb_product_sold | system_year |
+--------------+-----------------+-------------+
|2010 | 1 | 2012 |
|2012 | 2 | 2012 |
|2012 | 4 | 2012 |
|2015 | 3 | 2012 |
|2019 | 4 | 2012 |
|2021 | 5 | 2012 |
+--------------+-----------------+-------+
and I have a List<Integer> years = {1, 3, 8}, which means the x year after system_year year.
The goal is to calculate the number of total sold products for each year after system_year.
In other words, I have to calculate the total sold products for year 2013, 2015, 2020.
The output dataset should be like this :
+-------+-----------------------+
| year | total_product_sold |
+-------+-----------------------+
| 1 | 6 | -> 2012 - 2013 6 products sold
| 3 | 9 | -> 2012 - 2015 9 products sold
| 8 | 13 | -> 2012 - 2020 13 products sold
+-------+-----------------------+
I want to know how to do this in scala ? Should I use groupBy() in this case ?
You could have used a groupby case/when if the year ranges didn't overlap. But here you'll need to do a groupby for each year and then union the 3 grouped dataframes :
val years = List(1, 3, 8)
val result = years.map{ y =>
df.filter($"year".between($"system_year", $"system_year" + y))
.groupBy(lit(y).as("year"))
.agg(sum($"nb_product_sold").as("total_product_sold"))
}.reduce(_ union _)
result.show
//+----+------------------+
//|year|total_product_sold|
//+----+------------------+
//| 1| 6|
//| 3| 9|
//| 8| 13|
//+----+------------------+
There might be multiple ways of doing things and more efficient than what I am showing you but it works for your use case.
//Sample Data
val df = Seq((2010,1,2012),(2012,2,2012),(2012,4,2012),(2015,3,2012),(2019,4,2012),(2021,5,2012)).toDF("year","nb_product_sold","system_year")
//taking the difference of the years from system year
val df1 = df.withColumn("Difference",$"year" - $"system_year")
//getting the running total for all years present in the dataframe by partitioning
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val w = Window.partitionBy("year").orderBy("year")
val df2 = df1.withColumn("runningsum", sum("nb_product_sold").over(w)).withColumn("yearlist",lit(0)).dropDuplicates("year","system_year","Difference")
//creating Years list
val years = List(1, 3, 8)
//creating a dataframe with total count for each year and union of all the dataframe and removing duplicates.
var df3= spark.createDataFrame(sc.emptyRDD[Row], df2.schema)
for (year <- years){
val innerdf = df2.filter($"Difference" >= year -1 && $"Difference" <= year).withColumn("yearlist",lit(year))
df3 = df3.union(innerdf)
}
//again doing partition by system date and doing the sum for all the years as per requirement
val w1 = Window.partitionBy("system_year").orderBy("year")
val finaldf = df3.withColumn("total_product_sold", sum("runningsum").over(w1)).select("yearlist","total_product_sold")
you can see the output as below:

How to get the latest date from listed dates along with the total count?

I have the below DataFrame, it has keys with different dates out of which I would like to display latest date together with the count for each of the key-id pairs.
Input data as below:
id key date
11 222 1/22/2017
11 222 1/22/2015
11 222 1/22/2016
11 223 9/22/2017
11 223 1/22/2010
11 223 1/22/2008
Code I have tried:
val counts = df.groupBy($"id",$"key").count()
I am getting the below output,
id key count
11 222 3
11 223 3
However, I want like the output to be as below:
id key count maxDate
11 222 3 1/22/2017
11 223 3 9/22/2017
One way would be to transform the date into unixtime, do the aggregation and then convert it back again. This conversions to and from unixtime can be performed with unix_timestamp and from_unixtime respectively. When the date is in unixtime, the latest date can be selected by finding the maximum value. The only possible down-side of this approach is that the date format must be explicitly given.
val dateFormat = "MM/dd/yyyy"
val df2 = df.withColumn("date", unix_timestamp($"date", dateFormat))
.groupBy($"id",$"key").agg(count("date").as("count"), max("date").as("maxDate"))
.withColumn("maxDate", from_unixtime($"maxDate", dateFormat))
Which will give you:
+---+---+-----+----------+
| id|key|count| maxDate|
+---+---+-----+----------+
| 11|222| 3|01/22/2017|
| 11|223| 3|09/22/2017|
+---+---+-----+----------+
Perform an agg on both fields
df.groupBy($"id", $"key").agg(count($"date"), max($"date"))
Output:
+---+---+-----------+-----------+
| _1| _2|count(date)| max(date)|
+---+---+-----------+-----------+
| 11|222| 3| 1/22/2017|
| 11|223| 3| 9/22/2017|
+---+---+-----------+-----------+
Edit: The as option proposed in the other answer is pretty good too.
Edit: Comment below is true. You need to convert to a proper date format. You can check the other answer wich converts to timestamp or use udf
import java.text.SimpleDateFormat
import org.apache.spark.sql.{SparkSession, functions}
val simpleDateFormatOriginal:SimpleDateFormat = new SimpleDateFormat("MM/dd/yyyy")
val simpleDateFormatDestination:SimpleDateFormat = new SimpleDateFormat("yyyy/MM/dd")
val toyyyymmdd = (s:String) => {
simpleDateFormatDestination.format(simpleDateFormatOriginal.parse(s))
}
val toddmmyyyy = (s:String) => {
simpleDateFormatOriginal.format(simpleDateFormatDestination.parse(s))
}
val toyyyymmddudf = functions.udf(toyyyymmdd)
val toddmmyyyyyudf = functions.udf(toddmmyyyy)
df.withColumn("date", toyyyymmddudf($"date"))
.groupBy($"id", $"key")
.agg(count($"date"), max($"date").as("maxDate"))
.withColumn("maxDate", toddmmyyyyyudf($"maxDate"))

How to transpose dataframe in Spark 1.5 (no pivot operator available)?

I want to transpose following table using spark scala without Pivot function
I am using Spark 1.5.1 and Pivot function does not support in 1.5.1. Please suggest suitable method to transpose following table:
Customer Day Sales
1 Mon 12
1 Tue 10
1 Thu 15
1 Fri 2
2 Sun 10
2 Wed 5
2 Thu 4
2 Fri 3
Output table :
Customer Sun Mon Tue Wed Thu Fri
1 0 12 10 0 15 2
2 10 0 0 5 4 3
Following code is not working as I am using Spark 1.5.1 and pivot function is available from Spark 1.6:
var Trans = Cust_Sales.groupBy("Customer").Pivot("Day").sum("Sales")
Not sure how efficient that is, but you can use collect to get all the distinct days, and then add these columns, then use groupBy and sum:
// get distinct days from data (this assumes there are not too many of them):
val days: Array[String] = df.select("Day")
.distinct()
.collect()
.map(_.getAs[String]("Day"))
// add column for each day with the Sale value if days match:
val withDayColumns = days.foldLeft(df) {
case (data, day) => data.selectExpr("*", s"IF(Day = '$day', Sales, 0) AS $day")
}
// wrap it up
val result = withDayColumns
.drop("Day")
.drop("Sales")
.groupBy("Customer")
.sum(days: _*)
result.show()
Which prints (almost) what you wanted:
+--------+--------+--------+--------+--------+--------+--------+
|Customer|sum(Tue)|sum(Thu)|sum(Sun)|sum(Fri)|sum(Mon)|sum(Wed)|
+--------+--------+--------+--------+--------+--------+--------+
| 1| 10| 15| 0| 2| 12| 0|
| 2| 0| 4| 10| 3| 0| 5|
+--------+--------+--------+--------+--------+--------+--------+
I'll leave it to you to rename / reorder the columns if needed.
If you are working with python below code might help. Let's say you want to transpose spark DataFrame df:
pandas_df = df.toPandas().transpose().reset_index()
transposed_df = sqlContext.createDataFrame(pandas_df)
transposed_df.show()
Consider a data frame which has 6 columns and we want to group by first 4 columns and pivot on col5 while aggregating on col6 (say sum on it).
So lets say you cannot use the spark 1.6 version then the below code can be written (in spark 1.5) as:
val pivotedDf = df_to_pivot
.groupBy(col1,col2,col3,col4)
.pivot(col5)
.agg(sum(col6))
Here is the code with same output but without using in-built pivot function:
import scala.collection.SortedMap
//Extracting the col5 distinct values to create the new columns
val distinctCol5Values = df_to_pivot
.select(col(col5))
.distinct
.sort(col5) // ensure that the output columns are in a consistent logical order
.map(_.getString(0))
.toArray
.toSeq
//Grouping by the data frame to be pivoted on col1-col4
val pivotedAgg = df_to_pivot.rdd
.groupBy{row=>(row.getString(col1Index),
row.getDate(col2Index),
row.getDate(col3Index),
row.getString(col4Index))}
//Initializing a List of tuple of (String, double values) to be filled in the columns that will be created
val pivotColListTuple = distinctCol5Values.map(ft=> (ft,0.0))
// Using Sorted Map to ensure the order is maintained
var distinctCol5ValuesListMap = SortedMap(pivotColListTuple : _*)
//Pivoting the data on col5 by opening the grouped data
val pivotedRDD = pivotedAgg.map{groupedRow=>
distinctCol5ValuesListMap = distinctCol5ValuesListMap.map(ft=> (ft._1,0.0))
groupedRow._2.foreach{row=>
//Updating the distinctCol5ValuesListMap values to reflect the changes
//Change this part accordingly to what you want
distinctCol5ValuesListMap = distinctCol5ValuesListMap.updated(row.getString(col5Index),
distinctCol5ValuesListMap.getOrElse(row.getString(col5Index),0.0)+row.getDouble(col6Index))
}
Row.fromSeq(Seq(groupedRow._1._1,groupedRow._1._2,groupedRow._1._3,groupedRow._1._4) ++ distinctCol5ValuesListMap.values.toSeq)
}
//Consructing the structFields for new columns
val colTypesStruct = distinctCol5ValuesListMap.map(colName=>StructField(colName._1,DoubleType))
//Adding the first four column structFields with the new columns struct
val opStructType = StructType(Seq(StructField(col1Name,StringType),
StructField(col2Name,DateType),
StructField(col3Name,DateType),
StructField(col4Name,StringType)) ++ colTypesStruct )
//Creating the final data frame
val pivotedDF = sqlContext.createDataFrame(pivotedRDD,opStructType)

Pandas: how to merge two dataframes on offset dates?

I'd like to merge two dataframes, df1 & df2, based on whether rows of df2 fall within a 3-6 month date range after rows of df1. For example:
df1 (for each company I have quarterly data):
company DATADATE
0 012345 2005-06-30
1 012345 2005-09-30
2 012345 2005-12-31
3 012345 2006-03-31
4 123456 2005-01-31
5 123456 2005-03-31
6 123456 2005-06-30
7 123456 2005-09-30
df2 (for each company I have event dates that can happen on any day):
company EventDate
0 012345 2005-07-28 <-- won't get merged b/c not within date range
1 012345 2005-10-12
2 123456 2005-05-15
3 123456 2005-05-17
4 123456 2005-05-25
5 123456 2005-05-30
6 123456 2005-08-08
7 123456 2005-11-29
8 abcxyz 2005-12-31 <-- won't be merged because company not in df1
Ideal merged df -- rows with EventDates in df2 that are 3-6 months (i.e. 1 quarter) after DATADATEs in rows of df1 will be merged:
company DATADATE EventDate
0 012345 2005-06-30 2005-10-12
1 012345 2005-09-30 NaN <-- nan because no EventDates fell in this range
2 012345 2005-12-31 NaN
3 012345 2006-03-31 NaN
4 123456 2005-01-31 2005-05-15
5 123456 2005-01-31 2005-05-17
5 123456 2005-01-31 2005-05-25
5 123456 2005-01-31 2005-05-30
6 123456 2005-03-31 2005-08-08
7 123456 2005-06-30 2005-11-19
8 123456 2005-09-30 NaN
I am trying to apply this related topic [ Merge pandas DataFrames based on irregular time intervals ] by adding start_time and end_time columns to df1 denoting 3 months (start_time) to 6 months (end_time) after DATADATE, then using np.searchsorted(), but this case is a bit trickier because I'd like to merge on a company-by-company basis.
This is actually one of those rare questions where the algorithmic complexity might be significantly different for different solutions. You might want to consider this over the niftiness of 1-liner snippets.
Algorithmically:
sort the larger of the dataframes according to the date
for each date in the smaller dataframe, use the bisect module to find the relevant rows in the larger dataframe
For dataframes with lengths m and n, respectively (m < n) the complexity should be O(m log(n)).
This is my solution going off of the algorithm that Ami Tavory suggested below:
#find the date offsets to define date ranges
start_time = df1.DATADATE.apply(pd.offsets.MonthEnd(3))
end_time = df1.DATADATE.apply(pd.offsets.MonthEnd(6))
#make these extra columns
df1['start_time'] = start_time
df1['end_time'] = end_time
#find unique company names in both dfs
unique_companies_df1 = df1.company.unique()
unique_companies_df2 = df2.company.unique()
#sort df1 by company and DATADATE, so we can iterate in a sensible order
sorted_df1=df1.sort(['company','DATADATE']).reset_index(drop=True)
#define empty df to append data
df3 = pd.DataFrame()
#iterate through each company in df1, find
#that company in sorted df2, then for each
#DATADATE quarter of df1, bisect df2 in the
#correct locations (i.e. start_time to end_time)
for cmpny in unique_companies_df1:
if cmpny in unique_companies_df2: #if this company is in both dfs, take the relevant rows that are associated with this company
selected_df2 = df2[df2.company==cmpny].sort('EventDate').reset_index(drop=True)
selected_df1 = sorted_df1[sorted_df1.company==cmpny].reset_index(drop=True)
for quarter in xrange(len(selected_df1.DATADATE)): #iterate through each DATADATE quarter in df1
lo=bisect.bisect_right(selected_df2.EventDate,selected_CS.start_time[quarter]) #bisect_right to ensure that we do not include dates before our date range
hi=bisect.bisect_left(selected_IT.EventDate,selected_CS.end_time[quarter]) #bisect_left here to not include dates after our desired date range
df_right = selected_df2.loc[lo:hi].copy() #grab all rows with EventDates that fall within our date range
df_left = pd.DataFrame(selected_df1.loc[quarter]).transpose()
if len(df_right)==0: # if no EventDates fall within range, create a row with cmpny in the 'company' column, and a NaT in the EventDate column to merge
df_right.loc[0,'company']=cmpny
temp = pd.merge(df_left,df_right,how='inner',on='company') #merge the df1 company quarter with all df2's rows that fell within date range
df3=df3.append(temp)