Convert Unix Timestamp Subtraction to Hours/Minutes in PySpark - pyspark

I am performing a LAG operation between two unix timestamp which is working perfectly fine. However I would like to convert the resulted column in Hour/Minutes
df_new = df_new.withColumn("diff", F.when(F.isnull(df_new.calculated_time - df_new.prev_value), 0)
.otherwise(df_new.calculated_time.cast('long') - df_new.prev_value.cast('long')))
Output-
+--------------------+--------------------+---------------+-------------+---------+
| primary_key| status|calculated_time| prev_value| diff|
+--------------------+--------------------+---------------+-------------+---------+
|{"approval_id": "...|Pending review by...| 1562315397258| null| 0|
|{"approval_id": "...| Denied| 1562936139570|1562315397258|620742312|
|{"approval_id": "...|Request clarifica...| 1563172343614|1562936139570|236204044|
|{"approval_id": "...| null| 1563172473488|1563172343614| 129874|
|{"approval_id": "...| Approved| 1563190166533|1563172473488| 17693045|
+--------------------+--------------------+---------------+-------------+---------+
In the 'diff' column I would like to have minutes instead. Can anyone please help ?

Related

How to filter data based on the month and year value

I am trying to filter data based on the month and year value in date column.
I converted my date column from string to date as
df.withColumn('ifrs_year_dt', to_date(unix_timestamp('ifrs_year_dr', 'Mm/dd/yyyy).cast('timestamp)))
df=df.withColumn('month',month(df['ifrs_year_dt]))
I am getting error int object is not callable, when using month() function. I tried it inside filter and its saying the same.
df=df.filter(month(df['ifrs_year_dt])==3)
And still getting the same error.
Here is a minimal working example that I think you can adapt to your needs:
import pyspark.sql.functions as F
sample_dates = ['09/01/2021',
'10/01/2021',
'03/01/2021',
'07/10/2010']
df = spark.createDataFrame([(date,) for date in sample_dates], ["ifrs_year_dr"])
df_with_date = df.withColumn('ifrs_year_dt', F.to_date(F.unix_timestamp('ifrs_year_dr', 'MM/dd/yyyy').cast('timestamp')))
df_with_month=df_with_date.withColumn('month',F.month(df_with_date['ifrs_year_dt']))
df_with_month.show()
df_with_month.filter(F.col("month") == 3).show()
output:
+------------+------------+-----+
|ifrs_year_dr|ifrs_year_dt|month|
+------------+------------+-----+
| 09/01/2021| 2021-09-01| 9|
| 10/01/2021| 2021-10-01| 10|
| 03/01/2021| 2021-03-01| 3|
| 07/10/2010| 2010-07-10| 7|
+------------+------------+-----+
+------------+------------+-----+
|ifrs_year_dr|ifrs_year_dt|month|
+------------+------------+-----+
| 03/01/2021| 2021-03-01| 3|
+------------+------------+-----+

PySpark - Struggling to arrange the data by a specific format

I am working on outputting total deduplicated counts from a pre-aggregated frame as follows.
I currently have a data frame that displays like so. It's the initial structure and the point that I have gotten to by filtering out unneeded columns.
ID
Source
101
Grape
101
Flower
102
Bee
103
Peach
105
Flower
We can see from the example above that 101 is found in both Grape and Flower. I would like to arrange the format so that the distinct string values from the "Source" column become their own sources, as from there I can perform a groupBy for a specific arrangement of yes's and no's as so.
ID
Grape
Flower
Bee
Peach
101
Yes
Yes
No
No
102
No
No
Yes
No
103
No
No
No
Yes
I agree that creating this manually via the above example is a good fit, but I am working with +100m rows and need something more susinct.
What I've extracted so far is a list of distinct Source values and arranged them into a list:
dedupeTableColumnNames = dedupeTable.select('SOURCE').distinct().collect()
dedupeTableColumnNamesCleaned = re.findall(r"'([^']*)'", str(dedupeTableColumnNames))
That's just a pivot :
df.groupBy("id").pivot("source").count().show()
+---+------+------+------+------+
| id|Bee |Flower|Grape |Peach |
+---+------+------+------+------+
|103| null| null| null| 1|
|105| null| 1| null| null|
|101| null| 1| 1| null|
|102| 1| null| null| null|
+---+------+------+------+------+

Scala Spark functions like group by, describe() returning incorrect result

I have using Scala Spark on intellij IDE to analyze a csv file having 672,112 records . File is available on the link - https://www.kaggle.com/kiva/data-science-for-good-kiva-crowdfunding
File name : kiva_loans.csv
I ran show() command to view few records and it is reading all columns correctly but when I apply group by on the column "repayment_interval", it displays value which appears to be data from other columns (column shift ) as shown below.
distinct values in the "repayment_interval" columns are
Monthly (More frequent)
irregular
bullet
weekly (less frequent)
For testing purpose, I searched for values given in the screenshot and put those rows in a separate file and tried to read that file using scala spark. It is showing all values in correct column and even groupby is returning correct values.
I am facing this issue with describe() function.
As shown in above image , column - id & "funded_amount" is numeric columns but not sure why describe() on them is giving string values for "min","max".
read csv command as below
val kivaloans=spark.read
//.option("sep",",")
.format("com.databricks.spark.csv")
.option("header",true)
.option("inferschema","true")
.csv("kiva_loans.csv")
printSchema output after adding ".option("multiline","true")". It is reading few rows as header as shown in the highlighted yellow color.
It seems, there are new line characters in columns data. Hence, set property multiline as true.
val kivaloans=spark.read.format("com.databricks.spark.csv")
.option("multiline","true")
.option("header",true)
.option("inferschema","true")
.csv("kiva_loans.csv")
Data summary is as follows after setting multiline as true:
+-------+------------------+-----------------+-----------------+----------+-----------+--------------------+--------------------+------------------+------------+------------------+--------------------+--------------------+--------------------+--------------------+-----------------+------------------+--------------------+--------------------+--------------------+--------------------+
|summary| id| funded_amount| loan_amount| activity| sector| use| country_code| country| region| currency| partner_id| posted_time| disbursed_time| funded_time| term_in_months| lender_count| tags| borrower_genders| repayment_interval| date|
+-------+------------------+-----------------+-----------------+----------+-----------+--------------------+--------------------+------------------+------------+------------------+--------------------+--------------------+--------------------+--------------------+-----------------+------------------+--------------------+--------------------+--------------------+--------------------+
| count| 671205| 671205| 671205| 671205| 671205| 666977| 671197| 671205| 614441| 671186| 657699| 671195| 668808| 622890| 671196| 671199| 499834| 666957| 671191| 671199|
| mean| 993248.5937336581|785.9950611214159|842.3971066961659| null| null| 10000.0| null| null| null| null| 178.20274555550654| 162.01020408163265| 179.12244897959184| 189.3|13.74266332047713|20.588457578299735| 25.68553459119497| 26.4| 26.210526315789473| 27.304347826086957|
| stddev|196611.27542282813|1130.398941057504|1198.660072882945| null| null| NaN| null| null| null| null| 94.24892231613454| 78.65564973356628| 100.70555939905975| 125.87299363372507|8.631922222356161|28.458485403188924| 31.131029407317044| 35.87289875191111| 52.43279244938066| 41.99181173710449|
| min| 653047| 0.0| 25.0|Adult Care|AgricultuTo buy chicken.| ""fajas"" [wove...| 10 boxes of cream| 3x1 purlins| T-shaped brackets| among other prod...| among other item...| and pay for labour"| and cassava to m...| yeast| rice| milk| among other prod...|#Animals, #Biz Du...| #Elderly|
| 25%| 823364| 250.0| 275.0| null| null| 10000.0| null| null| null| null| 126.0| 123.0| 105.0| 87.0| 8.0| 7.0| 8.0| 8.0| 9.0| 6.0|
| 50%| 992996| 450.0| 500.0| null| null| 10000.0| null| null| null| null| 145.0| 144.0| 144.0| 137.0| 13.0| 13.0| 14.0| 15.0| 14.0| 17.0|
| 75%| 1163938| 900.0| 1000.0| null| null| 10000.0| null| null| null| null| 204.0| 177.0| 239.0| 201.0| 14.0| 24.0| 27.0| 31.0| 24.0| 34.0|
| max| 1340339| 100000.0| 100000.0| Wholesale| Wholesale|? provide a safer...| ZW| Zimbabwe| ?ZM?T| baguida| XOF| XOF| Yoro, Yoro| USD| USD| USD|volunteer_pick, v...|volunteer_pick, v...| weekly|volunteer_pick, v...|
+-------+------------------+-----------------+-----------------+----------+-----------+--------------------+--------------------+------------------+------------+------------------+--------------------+--------------------+--------------------+--------------------+-----------------+------------------+--------------------+--------------------+--------------------+--------------------+

WeekOfYear column getting null in the SparkSQL

Here I am writing the SQL statement for spark.sql but I am not able to get the WEEKOFYEAR converted to week of the year and getting a null in the Output
Below I have shown the expression of what I a using
Input Data
InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,01-12-2010 8.26,2.55,17850,United Kingdom
536365,71053,WHITE METAL LANTERN,6,01-12-2010 8.26,3.39,17850,United Kingdom
536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,01-12-2010 8.26,2.75,17850,United Kingdom
536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,01-12-2010 8.26,3.39,17850,United Kingdom
SQL CODE
val summarySQlTest = spark.sql(
"""
|select Country,WEEKOFYEAR(InvoiceDate)as WeekNumber,
|count(distinct(InvoiceNo)) as NumInvoices,
|sum(Quantity) as TotalQuantity,
|round(sum(Quantity*UnitPrice),2) as InvoiceValue
|from sales
|group by Country,WeekNumber
|""".stripMargin
).show()
DESIRED OUTPUT
+--------------+----------+-----------+-------------+------------+
| Country|WeekNumber|NumInvoices|TotalQuantity|InvoiceValue|
+--------------+----------+-----------+-------------+------------+
| Spain| 49| 1| 67| 174.72|
| Germany| 48| 11| 1795| 3309.75|
Output I am getting
+--------------+----------+-----------+-------------+------------+
| Country|WeekNumber|NumInvoices|TotalQuantity|InvoiceValue|
+--------------+----------+-----------+-------------+------------+
| Spain| null| 1| 67| 174.72|
| Germany| null| 11| 1795| 3309.75|
For the desired output I used this, But I want to solve the same in the spark.sql
Also it would be great if anyone can explain the what is actually happening here
(to_date(col("InvoiceDate"),"dd-MM-yyyy H.mm")
val knowFunc= invoicesDF.withColumn("InvoiceDate",to_date(col("InvoiceDate"),"dd-MM-yyyy H.mm"))
.where("year(InvoiceDate) == 2010")
.withColumn("WeekNumber",weekofyear(col("InvoiceDate")))
.groupBy("Country","WeekNumber")
.agg(sum("Quantity").as("TotalQuantity"),
round(sum(expr("Quantity*UnitPrice")),2).as("InvoiceValue")).show()
You'll need to convert the InvoiceDate column to date type first (using to_date), before you can call weekofyear. I guess this also answers your last question.
val summarySQlTest = spark.sql(
"""
|select Country,WEEKOFYEAR(to_date(InvoiceDate,'dd-MM-yyyy H.mm')) as WeekNumber,
|count(distinct(InvoiceNo)) as NumInvoices,
|sum(Quantity) as TotalQuantity,
|round(sum(Quantity*UnitPrice),2) as InvoiceValue
|from sales
|group by Country,WeekNumber
|""".stripMargin
).show()

Closest Date looking from One Column to another in PySpark Dataframe

I have a pyspark dataframe where price of Commodity is mentioned, but there is no data for when was the Commodity bought, I just have a range of 1 year.
+---------+------------+----------------+----------------+
|Commodity| BuyingPrice|Date_Upper_limit|Date_lower_limit|
+---------+------------+----------------+----------------+
| Apple| 5| 2020-07-04| 2019-07-03|
| Banana| 3| 2020-07-03| 2019-07-02|
| Banana| 4| 2019-10-02| 2018-10-01|
| Apple| 6| 2020-01-20| 2019-01-19|
| Banana| 3.5| 2019-08-17| 2018-08-16|
+---------+------------+----------------+----------------+
I have another pyspark dataframe where I can see the market price and date of all commodities.
+----------+----------+------------+
| Date| Commodity|Market Price|
+----------+----------+------------+
|2020-07-01| Apple| 3|
|2020-07-01| Banana| 3|
|2020-07-02| Apple| 4|
|2020-07-02| Banana| 2.5|
|2020-07-03| Apple| 7|
|2020-07-03| Banana| 4|
+----------+----------+------------+
I want to see the closest date to Upper limit of date when Market Price(MP) of that commodity < or = Buying Price(BP).
Expected Output (for 2 top columns):
+---------+------------+----------------+----------------+--------------------------------+
|Commodity| BuyingPrice|Date_Upper_limit|Date_lower_limit|Closest Date to UL when MP <= BP|
+---------+------------+----------------+----------------+--------------------------------+
| Apple| 5| 2020-07-04| 2019-07-03| 2020-07-02|
| Banana| 3| 2020-07-03| 2019-07-02| 2020-07-02|
+---------+------------+----------------+----------------+--------------------------------+
Even though Apple was much lower on 2020-07-01 ($3), but since 2020-07-02 was the first date going backwards from Upper Limit (UL) of date when MP <= BP. So, I selected 2020-07-02.
How can I see backwards to fill date of probable buying?
Try this with conditional join and window function
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w=Window().partitionBy("Commodity")
df1\ #first dataframe shown being df1 and second being df2
.join(df2.withColumnRenamed("Commodity","Commodity1")\
, F.expr("""`Market Price`<=BuyingPrice and Date<Date_Upper_limit and Commodity==Commodity1"""))\
.drop("Market Price","Commodity1")\
.withColumn("max", F.max("Date").over(w))\
.filter('max==Date').drop("max").withColumnRenamed("Date","Closest Date to UL when MP <= BP")\
.show()
#+---------+-----------+----------------+----------------+--------------------------------+
#|Commodity|BuyingPrice|Date_Upper_limit|Date_lower_limit|Closest Date to UL when MP <= BP|
#+---------+-----------+----------------+----------------+--------------------------------+
#| Banana| 3.0| 2020-07-03| 2019-07-02| 2020-07-02|
#| Apple| 5.0| 2020-07-04| 2019-07-03| 2020-07-02|
#+---------+-----------+----------------+----------------+--------------------------------+