Pyspark - Create Equivalent of Business current view in pyspark - pyspark

I need to create an equivalent of bussiness current view in pyspark , I have an history file and a delta file(containing id and date) .I need to create final dataframe which will have the single record for each id and that record should be of latest date .
df1=sql_context.createDataFrame([("3000", "2017-04-19"), ("5000", "2017-04-19"), ("9012", "2017-04-19")], ["id", "date"])
df2=sql_context.createDataFrame([("3000", "2017-04-18"), ("5120", "2017-04-18"), ("1012", "2017-04-18")], ["id", "date"])
df3=df2.union(df1).distinct()
+----+----------+
| id| date|
+----+----------+
|3000|2017-04-19|
|3000|2017-04-18|
|5120|2017-04-18|
|5000|2017-04-19|
|1012|2017-04-18|
|9012|2017-04-19|
+----+----------+
I tried doing a union and do a distinct , it gives me id=3000 for both the dates where as I need only record for id=300 for date=2017-04-19
Even subtract doesnt work since it returns all the rows of either of the df's .
Desired output:-
+----+----------+
| id| date|
+----+----------+
|3000|2017-04-19|
|
|5120|2017-04-18|
|5000|2017-04-19|
|1012|2017-04-18|
|9012|2017-04-19|
+----+----------+

Hope this helps!
from pyspark.sql.functions import unix_timestamp, col, to_date, max
#sample data
df1=sqlContext.createDataFrame([("3000", "2017-04-19"),
("5000", "2017-04-19"),
("9012", "2017-04-19")],
["id", "date"])
df2=sqlContext.createDataFrame([("3000", "2017-04-18"),
("5120", "2017-04-18"),
("1012", "2017-04-18")],
["id", "date"])
df=df2.union(df1)
df.show()
#convert 'date' column to date type so that latest date can be fetched for an ID
df = df.\
withColumn('date_inDateFormat',to_date(unix_timestamp(col('date'),"yyyy-MM-dd").cast("timestamp"))).\
drop('date')
#get latest date for an ID
df = df.groupBy('id').agg(max('date_inDateFormat').alias('date'))
df.show()
Output is:
+----+----------+
| id| date|
+----+----------+
|5000|2017-04-19|
|1012|2017-04-18|
|5120|2017-04-18|
|9012|2017-04-19|
|3000|2017-04-19|
+----+----------+
Note: Please don't forget to let SO know if the answer helps you solve your problem.

Related

I need to create a new dataframe as below in pysaprk from given input dataset

persons who has same salary should come in same record and their names should be separated by ",".
input Dataset :
Expected Dataset
You can achieve this as below -
Apply a groupBy on Salary and use - collect_list to club all the Name inside an ArrayType()
Further you can choose to convert it to a StringType using - concat_ws
Data Preparation
df = pd.read_csv(StringIO("""Name,Salary
abc,100000
bcd,20000
def,100000
pqr,20000
xyz,30000
""")
,delimiter=','
).applymap(lambda x: str(x).strip())
sparkDF = sql.createDataFrame(df)
sparkDF.groupby("Salary").agg(F.collect_list(F.col("Name")).alias('Name')).show(truncate=False)
+------+----------+
|Salary|Name |
+------+----------+
|100000|[abc, def]|
|20000 |[bcd, pqr]|
|30000 |[xyz] |
+------+----------+
Concat WS
sparkDF.groupby("Salary").agg(F.concat_ws(",",F.collect_list(F.col("Name"))).alias('Name')).show(truncate=False)
+------+-------+
|Salary|Name |
+------+-------+
|100000|abc,def|
|20000 |bcd,pqr|
|30000 |xyz |
+------+-------+

How to get 1st day of the year in pyspark

I have a date variable that I need to pass to various functions.
For e.g, if I have the date in a variable as 12/09/2021, it should return me 01/01/2021
How do I get 1st day of the year in PySpark
You can use the trunc-function which truncates parts of a date.
df = spark.createDataFrame([()], [])
(
df
.withColumn('current_date', f.current_date())
.withColumn("year_start", f.trunc("current_date", "year"))
.show()
)
# Output
+------------+----------+
|current_date|year_start|
+------------+----------+
| 2022-02-23|2022-01-01|
+------------+----------+
x = '12/09/2021'
'01/01/' + x[-4:]
output: '01/01/2021'
You can achieve this with date_trunc with to_date as the later returns a Timestamp rather than a Date
Data Preparation
df = pd.DataFrame({
'Date':['2021-01-23','2002-02-09','2009-09-19'],
})
sparkDF = sql.createDataFrame(df)
sparkDF.show()
+----------+
| Date|
+----------+
|2021-01-23|
|2002-02-09|
|2009-09-19|
+----------+
Date Trunc & To Date
sparkDF = sparkDF.withColumn('first_day_year_dt',F.to_date(F.date_trunc('year',F.col('Date')),'yyyy-MM-dd'))\
.withColumn('first_day_year_timestamp',F.date_trunc('year',F.col('Date')))
sparkDF.show()
+----------+-----------------+------------------------+
| Date|first_day_year_dt|first_day_year_timestamp|
+----------+-----------------+------------------------+
|2021-01-23| 2021-01-01| 2021-01-01 00:00:00|
|2002-02-09| 2002-01-01| 2002-01-01 00:00:00|
|2009-09-19| 2009-01-01| 2009-01-01 00:00:00|
+----------+-----------------+------------------------+

How to get year and week number aligned for a date

While trying to get year and week number of a range of dates spanning multiple years, I am getting into some issues with the start/end of the year.
I understand the logic for weeknumber and the one of year when they run separately. However, when they are combined, in some cases they don't bring consistent results and I was wondering what is the best way in Spark to make sure that those scenarios are handled with a consistent year for the given weeknumber,
For example, running:
spark.sql("select year('2017-01-01') as year, weekofyear('2017-01-01') as weeknumber").show(false)
outputs:
+----+----------+
|year|weeknumber|
+----+----------+
|2017|52 |
+----+----------+
But the wanted output would be:
+----+----------+
|year|weeknumber|
+----+----------+
|2016|52 |
+----+----------+
and running:
spark.sql("select year('2018-12-31') as year, weekofyear('2018-12-31') as weeknumber").show(false)
produces:
+----+----------+
|year|weeknumber|
+----+----------+
|2018|1 |
+----+----------+
But what is expected is:
+----+----------+
|year|weeknumber|
+----+----------+
|2019|1 |
+----+----------+
Code is running on Spark 2.4.2.
This spark behavior is consistent with the ISO 8601 definition. You can not change it. However there is a workaround I could think of.
You can first determine dayOfWeek, and if it is less than 4, you increase the year by one, if it equals to 4 then keep the year untouched. Otherwise decrease the year by one.
Example with 2017-01-01
sql("select case when date_format('2017-01-01', 'u') < 4 then year('2017-01-01')+1 when date_format('2017-01-01', 'u') = 4 then year('2017-01-01') else year('2017-01-01')- 1 end as year, weekofyear('2017-01-01') as weeknumber, date_format('2017-01-01', 'u') as dayOfWeek").show(false)
+----+----------+---------+
|year|weeknumber|dayOfWeek|
+----+----------+---------+
|2016|52 |7 |
+----+----------+---------+
Example with 2018-12-31
sql("select case when date_format('2018-12-31', 'u') < 4 then year('2018-12-31')+1 when date_format('2018-12-31', 'u') = 4 then year('2018-12-31') else year('2018-12-31')- 1 end as year, weekofyear('2018-12-31') as weeknumber, date_format('2018-12-31', 'u') as dayOfWeek").show(false)
+----+----------+---------+
|year|weeknumber|dayOfWeek|
+----+----------+---------+
|2019|1 |1 |
+----+----------+---------+
val df = Seq(("2017-01-01"), ("2018-12-31")).toDF("dateval")
+----------+
| dateval|
+----------+
|2017-01-01|
|2018-12-31|
+----------+
df.createOrReplaceTempView("date_tab")
val newDF = spark.sql("""select dateval,
case when weekofyear(dateval)=1 and month(dateval)=12 then struct((year(dateval)+1) as yr, weekofyear(dateval) as wk)
when weekofyear(dateval)=52 and month(dateval)=1 then struct((year(dateval)-1) as yr, weekofyear(dateval) as wk)
else struct((year(dateval)) as yr, weekofyear(dateval) as wk) end as week_struct
from date_tab""");
newDF.select($"dateval", $"week_struct.yr", $"week_struct.wk").show()
+----------+----+---+
| dateval| yr| wk|
+----------+----+---+
|2017-01-01|2016| 52|
|2018-12-31|2019| 1|
+----------+----+---+
You can also use a UDF to achieve this
import org.apache.spark.sql.types._
import java.time.temporal.IsoFields
def weekYear(date: java.sql.Date) : Option[Int] = {
if(date == null) None
else Some(date.toLocalDate.get(IsoFields.WEEK_BASED_YEAR))
}
Register this udf as
spark.udf.register("yearOfWeek", weekYear _)
Result:-
scala> spark.sql("select yearOfWeek('2017-01-01') as year, WEEKOFYEAR('2017-01-01') as weeknumber").show(false)
+----+----------+
|year|weeknumber|
+----+----------+
|2016|52 |
+----+----------+

Converting string time to day timestamp

I have just started working for Pyspark, and need some help converting a column datatype.
My dataframe has a string column, which stores the time of day in AM/PM, and I need to convert this into datetime for further processing/analysis.
fd = spark.createDataFrame([(['0143A'])], ['dt'])
fd.show()
+-----+
| dt|
+-----+
|0143A|
+-----+
from pyspark.sql.functions import date_format, to_timestamp
#fd.select(date_format('dt','hhmma')).show()
fd.select(to_timestamp('dt','hhmmaa')).show()
+----------------------------+
|to_timestamp(`dt`, 'hhmmaa')|
+----------------------------+
| null|
+----------------------------+
Expected output: 01:43
How can I get the proper datetime format in the above scenario?
Thanks for your help!
If we look at the doc for to_timestamp (http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.to_timestamp) we see that the format must be specified as a SimpleDateFormat (https://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html).
In order to retrieve the time of the day in AM/PM, we must use hhmma. But in SimpleDateFormat, a catches AM or PM, and not A or P. So we need to change our string :
import pyspark.sql.functions as F
df = spark.createDataFrame([(['0143A'])], ['dt'])
df2 = df.withColumn('dt', F.concat(F.col('dt'), F.lit('M')))
df3 = df2.withColumn('ts', F.to_timestamp('dt','hhmma'))
df3.show()
+------+-------------------+
| dt| ts|
+------+-------------------+
|0143AM|1970-01-01 01:43:00|
+------+-------------------+
If you want to retrieve it as a string in the format you mentionned, you can use date_format :
df4 = df3.withColumn('time', F.date_format(F.col('ts'), format='HH:mm'))
df4.show()
+------+-------------------+-----+
| dt| ts| time|
+------+-------------------+-----+
|0143AM|1970-01-01 01:43:00|01:43|
+------+-------------------+-----+

How to retrieve the month from a date column values in scala dataframe?

Given:
val df = Seq((1L, "04-04-2015")).toDF("id", "date")
val df2 = df.withColumn("month", from_unixtime(unix_timestamp($"date", "dd/MM/yy"), "MMMMM"))
df2.show()
I got this output:
+---+----------+-----+
| id| date|month|
+---+----------+-----+
| 1|04-04-2015| null|
+---+----------+-----+
However, I want the output to be as below:
+---+----------+-----+
| id| date|month|
+---+----------+-----+
| 1|04-04-2015|April|
+---+----------+-----+
How can I do that in sparkSQL using Scala?
This should do it:
val df2 = df.withColumn("month", date_format(to_date($"date", "dd-MM-yyyy"), "MMMM"))
df2.show
+---+----------+-----+
| id| date|month|
+---+----------+-----+
| 1|04-04-2015|April|
+---+----------+-----+
NOTE:
The first string (to_date) must match the format of your existing date
Be careful with: "dd-MM-yyyy" vs "MM-dd-yyyy"
The second string (date_format) is the format of the output
Docs:
to_date
date_format
Nothing Wrong in your code just keeps your date format as your date column.
Here i am attaching screenshot with your code and change codes.
HAppy Hadoooooooooooopppppppppppppppppppppp
Not exactly related to this question but who wants to get a month as integer there is a month function:
val df2 = df.withColumn("month", month($"date", "dd-MM-yyyy"))
df2.show
+---+----------+-----+
| id| date|month|
+---+----------+-----+
| 1|04-04-2015| 4|
+---+----------+-----+
The same way you can use the year function to get only year.