How to get recent value in Spark dataframe - scala

I have a Spark data frame with following structure
id flag price date
a 0 100 2015
a 0 50 2015
a 1 200 2014
a 1 300 2013
a 0 400 2012
I need to create a data frame with recent value of flag 1 and updated in the flag 0 rows.
id flag price date new_column
a 0 100 2015 200
a 0 50 2015 200
a 1 200 2014 null
a 1 300 2013 null
a 0 400 2012 null
We have 2 rows having flag=0. Consider the first row(flag=0),I will have 2 values(200 and 300) and I am taking the recent one 200(2014). And the last row I don't have any recent value for flag 1 so it is updated with null.
Looking for a solution using Scala. Any help would be appreciated.Thanks

You can try to use window functions.
Basically create a window where you partition by id and order by date. Then get the previous line for every line. Lastly, use when/otherwise to turn all cases with flag 1 to null.
Something like this:
val df = sc.parallelize(List(("a",0,100,2015),("a",1,200,2014),("a",1,300,2013),("a",0,400,2012))).toDF("id","flag","price","date")
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{lag,when}
val wSpec1 = Window.partitionBy("id").orderBy("date")
val df2=df.withColumn("last1",when(df("flag")===0,lag('price, 1).over(wSpec1)).otherwise(null))

Related

Epoch time difference when dates vary not showing correct value

SELECT
r.order_id
,c.order_time
,r.pickup_time
,EXTRACT(epoch FROM (r.pickup_time - c.order_time) / 60) AS diff
FROM runner_orders1 r
JOIN customer_orders1 c ON c.order_id = r.order_id
WHERE distance != 0;
order_id
order_time
pickup_time
diff
1
2020-01-01 18:05:02
2020-01-01 18:15:34
10.533333
2
2020-01-01 19:00:52
2020-01-01 19:10:54
10.033333
3
2020-01-02 23:51:23
2020-01-03 00:12:37
21.233333
3
2020-01-02 23:51:23
2020-01-03 00:12:37
21.233333
4
2020-01-04 13:23:46
2020-01-04 13:53:03
29.283333
4
2020-01-04 13:23:46
2020-01-04 13:53:03
29.283333
Here is my above sql output if you see when the timestamp days differ , the value is not correct. Please check and help.
3 | 2020-01-02 23:51:23 | 2020-01-03 00:12:37 | 21.233333
Your query is returning exactly the correct results (at least to the 6 decimal digits). The problem seems to stem from your expectations. Apparently, you are looking for diff to be minuets and seconds, however, this is not what you are getting. By extracting epoch and dividing by 60 your results are in minuets and fractional part of a minuet. In the selected the difference in time between pickup_time and order_time is 21:14 (min:sec) the result returned turns out to be 21:13.99998 (min:sec.fractional seconds). This is just another reason using epoch is not really a good practice (see Epoch Mania for others).
Correct the result by just subtracting the dates (you are already doing so). This gives the result as an interval in Hours:Minuets:Seconds. For your example it returns 00:21:14 (see demo)
select r.order_id
, c.order_time
, r.pickup_time
, r.pickup_time - c.order_time as diff
from runner_orders1 r
join customer_orders1 c on c.order_id = r.order_id
where distance != 0;

Get count of the value repeated in the last 24 hours in pyspark dataframe

Please help me with this pyspark code. I need to count the number of times an ip appeared in the last 24 hours excluding that instance. The first time an ip appears in the data the count_last24hrs column should return value as 0. From next time onwards, the code should count the number of times the same ip has appeared in the last 24 hours from that timestamp excluding that instance.
I was trying to use window function but was not getting the desired result.
count_last24hrs is the column name in which the result should appear.
Using this data frame as df
column names as (datetime, ip, count_last24hrs)
(10/05/2022 10:14:00 AM, 1.1.1.1, 0)
(10/05/2022 10:16:00 AM, 1.1.1.1, 1)
(10/05/2022 10:18:00 AM, 2.2.2.2, 0)
(10/05/2022 10:21:00 AM, 1.1.1.1, 2)
snapshot of the data using
Code that I was trying
#function to calculate the number of seconds from the number of days
days = lambda i: i * 86400
#create window by casting timestamp to long (number of seconds)
w = (Window.orderBy(F.col("datetime").cast('long')).rangeBetween(-days(1), 0))
#use collect_set and size functions to perform countDistinct over a window
df_new= df.withColumn('count_last24hrs', F.size(F.collect_set("ip”).over(w)))
result = (df
.withColumn('ip_count', F.expr("count(ip_address) over (partition by ip_address order by datetimecol range between interval 24 hours preceding and current row)"))
.withColumn('ip_count',when(f.col('ip')==0,0).otherwise(f.col('ip')-1) ).
select('datetimecol', 'ip_address','ip_count')
The first withColumn statement selects data in last 24 hours and partition the data by "ip_address" ordered by time and finds cumumulative sum
The Second withColumn makes the count decrement by 1. So that the first count is 0 instead of 1.
Result:
datetimecol
ip
ip_last24_hrs
2022-05-10 10:14:00
1.1.1.1
0
2022-05-10 10:16:00
1.1.1.1
1
2022-05-10 10:18:00
2.2.2.2
0
2022-05-10 10:21:00
1.1.1.1
2

How to return null if an error for try_parse

I am trying to simply return NULL if I get an error from TRY_PARSE. I am using TRY_PARSE to parse out a datetime from a string.
create table #example (
ID int identity(1, 1)
, extractedDateTime varchar(50)
)
insert into #example (extractedDateTime)
values ('7/19/21 11:15')
,('/30/21 1100')
,('05/15/2021 17:00')
,('05/03/2021 0930')
,('5/26/21 09:30')
,('05/26/2021 0930')
,('06/09/2021 12:00')
,('07/06/2021 13:00')
,('6/15/21 12:00')
,('07/09/2021 07:30')
,('07/14/2021 13:20')
,('/19/2021 10:30')
,('7/22/21 1030')
,('7/21/201')
,('06/21/21 11:00')
select exm.ID, exm.extractedDateTime, [TRY_PARSED] = TRY_PARSE(exm.extractedDateTime as datetime2 using 'en-US')
from #example as exm
drop table #example
In the above example there is ID 14: '7/21/201' which will be parsed as from the year 201 (presumably it was meant to be 21 or 2021). I have gotten this to parse as datetime2, originally I was using datetime. I am inclined to still use datetime, but what I would like is to return NULL for that particular row. Instead a get a lengthy error message about a SqlDateTime overflow, because of using datetime of course.
The reason I want to go back to using datetime is this is an incorrect value, and using datetime might help filter out erroneous values like this. Also, I'd like the query to be able to return NULL anyways, whenever this little bit encounters an error so that it doesn't stop the entire query (this is part of a much larger query).
How can I return NULL for this record? Any advice would be greatly appreciated!
UPDATE:
Here is the picture I get from executing the seen SELECT statement:
As others have noted, the TRY_PARSE is functioning as expected and documented.
That doesn't help your case, though, so I'd suggest setting an arbitrary minimum date that any date prior to that gets assigned a NULL.
Sample code:
SELECT exm.ID,
exm.extractedDateTime,
[TRY_PARSED] = CASE
WHEN TRY_PARSE(exm.extractedDateTime AS DATETIME2 USING 'en-US') < '1970-01-01'
THEN NULL
ELSE TRY_PARSE(exm.extractedDateTime AS DATETIME2 USING 'en-US')
END
FROM #example AS exm;
Results:
1
7/19/21
11:15
2021-07-19 11:15:00.0000000
2
/30/21 1100
NULL
3
05/15/2021
17:00
2021-05-15 17:00:00.0000000
4
05/03/2021
0930
NULL
5
5/26/21
09:30
2021-05-26 09:30:00.0000000
6
05/26/2021
0930
NULL
7
06/09/2021
12:00
2021-06-09 12:00:00.0000000
8
07/06/2021
13:00
2021-07-06 13:00:00.0000000
9
6/15/21 12:00
2021-06-15 12:00:00.0000000
10
07/09/2021
07:30
2021-07-09 07:30:00.0000000
11
07/14/2021 13:20
2021-07-14 13:20:00.0000000
12
/19/2021 10:30
NULL
13
7/22/21 1030
NULL
14
7/21/201
NULL
15
06/21/21
11:00
2021-06-21 11:00:00.0000000
The advice is not to do anything since what you ask for is the default behavior of TRY_PARSE. Check the documentation
https://learn.microsoft.com/en-us/sql/t-sql/functions/try-parse-transact-sql?view=sql-server-ver15

Bin dates without aggregation in Pandas

I have a date column in a Pandas.DataFrame:
date Value
2014-02-27 0
2014-08-15 1
2015-04-11 1
2014-09-01 2
I need a function to create a new column that identifies what quarter, month, etc. a record belongs to as shown below:
get_date_bucket(date_var='date', frequency='Q')
date Value date_bucket
2014-02-27 0 2014-03-31
2014-08-15 1 2014-09-30
2015-04-11 1 2015-06-30
2014-09-01 2 2014-09-30
or...
get_date_bucket(date_var='date', frequency='M')
date Value date_bucket
2014-02-27 0 2014-02-29
2014-08-15 1 2014-08-31
2015-04-11 1 2015-04-30
2014-09-01 2 2014-09-30
The data is reasonably large and I don't want to do any aggregation if I can avoid it. What is the simplest way to create the 'date_bucket' column from the 'date' column on the left?
easiest way is using a pd.offset
df['date_bucket'] = df.date + pd.offsets.QuarterEnd()
df
To generalize to any frequency specified by a string
from pandas.tseries.frequencies import to_offset
df.date + to_offset('Q')

MDX iif date greather than doesn't evaluate correctly in my cube

I have a problem which seems to be very simple to solve but I can't. In my Fact table I have a Timestamp field which is a smalldatetime Type. This fact is linked to a Time dimension via its fulldate_Fk (also SmallDatetime). So What I would like to have is to compare the timestamp with the FullDate_FK from the fact to create a calculation like this:
iif([Dim Time].[Date].CurrentMember.MemberValue <=
[Fact].[Timestamp].CurrentMember.MemberValue
,[measures].[YTD Actuals]
,[measures].[YTD Actuals]+[measures].[YTD Com])
But it is not working at all. All [Dim Time].[Date] seem to be evaluated as < than the Timestamp.
P.S: The Timestamp is the last date when the data have been loaded in the DB (in my case 31/08)
Here the result I got:
MONTH | YTD Actuals | YTD Com | Calculation;
JAN , 10 , 10 , 10;
FEB , 20 , 10 , 20;
MAR , 40 , 20 , 40;
MAY , 60 , 30 , 60;
JUN , 70 , 50 , 70;
JUL , 85 , 50 , 85;
AUG , 120 , 55 , 120;
SEP , 120 , 60 , 120;
OCT , 120 , 70 , 120;
NOV , 120 , 80 , 120;
DEC , 120 , 90 , 120;
From August, I should have the sum of Actuals YTD and Com YTD in the calculation, but I still have the Actuals YTD only?
Extra Info
I'm using PivotTable just by dragging attributes in Excel. Month in rows and measures (the 2 YTD and the new calculated member)
If you build a new calc which is:
[Fact].[Timestamp].CurrentMember.MemberValue
What does it return when you add it to your PivotTable? Null? I suspect the CurrentMember is the All member so MemberValue is null. But let's test that.
Do all rows in the fact table have the same Timestamp or are there many different timestamps?
If your fact table has 10000 rows are you expecting the IIf calc will be evaluated 10000 times (once for each row)? That's not the way MDX works. In your PivotTable that has 12 rows the IIf calc gets evaluated 12 times at the month grain.
If you want the calculation to happen on each of the 10000 rows then write the calculation in SQL and do it in a SQL view before it gets to the cube.
To make the calc work as you intend in the cube consider doing the following. Add a new column in your DimTime SQL table called Today Flag. It should be updated during the ETL to be Y only on the row which is today and should be N on other rows. Then add that column as a new attribute to your Dim Time dimension. You can make it Visible=False. Then go to the Calculations tab and flip to the Script view and replace your current [Measures].[Calculation] calc with this:
Create Member CurrentCube.[Measures].[Calculation] as
[measures].[YTD Actuals];
Scope({Exists([Dim Time].[Month].[Month].Members,[Dim Time].[Today Flag].&[Y]).Item(0).Item(0):null});
[Measures].[Calculation] = [measures].[YTD Actuals]+[measures].[YTD Com];
End Scope;