Spark time series data generation - scala

I am trying to generate time series data in Spark and Scala. I have the following data in DataFrame which is hourly data
sid|date |count
200|2016-04-30T18:00:00:00+00:00 | 10
200 |2016-04-30T21:00:00:00+00:00 | 5
I want to generate time series data for the last 2 days hourly by taking the max time from the input. In my case the series data should start from 2016-04-30T21:00:00:00+00:00 and generate hourly data.Any hour without data then it should put the count as null. Sample output as follows
id|sid|date |count
1 |200|2016-04-28T22:00:00:00+00:00 |
2 |200|2016-04-28T23:00:00:00+00:00 |
3 |200|2016-04-29T00:00:00:00+00:00 |
--------------------------------------
45|200|2016-04-30T18:00:00:00+00:00 |10
--------------------------------------
--------------------------------------
48|200|2016-04-30T21:00:00:00+00:00 |5
Thanks,

Related

How to calculate standard deviation over a range of dates when there are dates missing in pyspark 2.2.0

I have a pyspark df wherein I am using a combination of windows + udf function to calculate standard deviation over historical business dates. The challenge is that my df is missing dates when there is no transaction. How do I calculate std dev including these missing dates without adding them as additional rows into my df to limit the df size going out of memory.
Sample Table & Current output
| ID | Date | Amount | Std_Dev|
|----|----------|--------|--------|
|1 |2021-03-24| 10000 | |
|1 |2021-03-26| 5000 | |
|1 |2021-03-29| 10000 |2886.751|
Current Code
from pyspark.sql.functions import udf,first,Window,withColumn
import numpy as np
from pyspark.sql.types import IntegerType
windowSpec = Window.partitionBy("ID").orderBy("date")
workdaysUDF = F.udf(lambda date1, date2: int(np.busday_count(date2, date1)) if (date1 is not None and date2 is not None) else None, IntegerType()) # UDF to calculate difference between business days#
df = df.withColumn("date_dif", workdaysUDF(F.col('Date'), F.first(F.col('Date')).over(windowSpec))) #column calculating business date diff#
windowval = lambda days: Window.partitionBy('id').orderBy('date_dif').rangeBetween(-days, 0)
df = df.withColumn("std_dev",F.stddev("amount").over(windowval(6))\
.drop("date_dif")
Desired Output where the values of dates missing between 24 to 29 March are being substituted with 0.
| ID | Date | Amount | Std_Dev|
|----|----------|--------|--------|
|1 |2021-03-24| 10000 | |
|1 |2021-03-26| 5000 | |
|1 |2021-03-29| 10000 |4915.96 |
Please note that I am only showing std dev for a single date for illustration, there would be value for each row as I am using a rolling windows function.
Any help would be greatly appreciated.
PS: Pyspark version is 2.2.0 at enterprise so I do not have flexibility to change the version.
Thanks,
VSG

How to calculate the AVG time stamp for last one week

I am trying to calculate the AVG timestamp for last 7 days in Snowflake database.
Data type is VARCHAR and below is the sample data.
LOAD_TIME VARCHAR(10) -
Sample Data:
LOAD_TIME (HHMM)
1017
0927
0713
0645
1753
2104
1253
If you convert these values to epoch_seconds, it's possible to calculate the average:
select to_varchar(to_timestamp(avg(date_part(epoch_second,to_timestamp(load_time,'HH24MI')))), 'HH24MI') as average
from values
('1017'),('0927'),('0713'),('0645'),('1753'),('2104'),('1253') tmp (load_time);
+---------+
| AVERAGE |
+---------+
| 1213 |
+---------+

Stored procedure (or better way) to add a new row to existing table every day at 22:00

I will be very grateful for your advice regarding the following issue.
Given:
PostgreSQL database
Initial (basic) query
select day, Value_1, Value_2, Value_3
from table
where day=current_date
which returns a row with following columns
Day | Value_1(int) | Value_2(int) | Value 3 (int)
2019-11-14 | 10 | 10 | 14
It is needed to create a view with this starting information and add a new row every day based on the outcome of initial query executed at 22:00.
The expected outcome tomorrow at 22:01 will be
Day | Value_1 | Value_2 | Value_3
2019-11-14 | 10 | 10 | 14
2019-11-15 | N | M | P
Many thanks in advance for your time and support.

Calculate depart flights from sorted data using Spark

I've a dataset of flights in the form of
+----------------+----------+-------------+
|flightID |depart_ts |arrival_ts |
+----------------+----------+-------------+
|1 |1451603468| 1451603468|
|2 |1451603468| 1451603468|
|3 |1451603468| 1451603468|
|4 |1451603468| 1451603468|
|5 |1451603468| 1451603468|
+----------------+----------+-------------+
and my job is to use Apache Spark to find the return flight for each flight given some conditions (departure time of return flight B should be within 2 hours from arrival time of flight A). Doing a cross join of 1M Record to check these conditions is not efficient and will cost much time. I've thought about using window function with 1 partition and custom UDAF to do the calculation. Something like this
1. val flightsWindow = Window.orderBy("depart_ts").rangeBetween(0, 7200)
2. flights.withColumn("returnFlightID", calcReturn( $"arrival_ts", $"depart_ts").over(flightsWindow)).show()
Considering that this approach will lead to a solution, I'm facing some challenges:
In line 1, I want to let frame range span from CURRENT ROW to arrival_ts + 7200, but apparently I cannot do dynamic ranges in spark, no?
In line 1 and assuming that 2 flights have the same arrival time, this will make it impossible to retrieve the values of the second flight when CURRENT_ROW pointer moves there, since difference between first flight and second flight is 0. Is it possible to explicitly tell range to start framing from CURRENT_ROW?
In line 2, I want to retrieve the depart_ts value for the very first row of the frame to compare against other flights in the frame. Is it possible to do that. I tried first() function but it doesn't fit in my case.

Cognos Calculate Variance Crosstab (Dimensional)

This is very similar to Cognos Calculate Variance Crosstab (Relational), but my data source is dimensional.
I have a simple crosstab such as this:
| 04-13-2013 | 04-13-2014
---------------------------------------
Sold | 75 | 50
Purchased | 10 | 15
Repaired | 33 | 44
Filter: The user selects 1 date and then we include that date plus 1 year ago.
Dimension: The date is the day level in a YQMD Hierarchy.
Measures: We are showing various measures from a Measure Dimension.
Sold
Purchased
Repaired
Here is what is looks like in report studio:
| <#Day#> | <#Day#>
---------------------------------------
<#Sold#> | <#1234#> | <#1234#>
<#Purchased#> | <#1234#> | <#1234#>
<#Repaired#> | <#1234#> | <#1234#>
I want to be able to calculate the variance as a percentage between the two time periods for each measure like this.
| 04-13-2013 | 04-13-2014 | Var. %
-----------------------------------------------
Sold | 75 | 50 | -33%
Purchased | 10 | 15 | 50%
Repaired | 33 | 44 | 33%
I added a Query Expression to the right of the <#Day#> as shown below, but I cannot get the variance calculation to work.
| <#Day#> | <#Variance#>
---------------------------------------
<#Sold#> | <#1234#> | <#1234#>
<#Purchased#> | <#1234#> | <#1234#>
<#Repaired#> | <#1234#> | <#1234#>
These are the expressions I've tried and the results that I get:
An expression that is hard coded works, but only for that 1 measure:
total(case when [date] = 2014-04-13 then [Sold] end)
/
total(case when [date] = 2013-04-13 then [Sold] end)
-1
I thought CurrentMember and PrevMember might work, but it produces blank cells:
CurrentMember( [YQMD Hierarchy] )
/
prevMember(CurrentMember([YQMD Hierarchy]))
-1
I think it is because prevMember produces blank.
prevMember(CurrentMember([YQMD Hierarchy]))
Using only CurrentMember gives a total of both columns:
CurrentMember([YQMD Hierarchy])
What expression can I use to take advantage of my dimensional model and add a column with % variance?
These are the pages I used for research:
Variance reporting in Report Studio on Cognos 8.4?
Calculations that span dimensions - PDF
IBM Cognos 10 Report Studio: Creating Consumer-Friendly Reports
I hope there is a better way to do this. I finally found a resource that describes one approach to this problem. Using the tail and head functions, we can get to the first and last periods, and thereby calculate the % variance.
item(tail(members([Day])),0)
/
item(head(members([Day])),0)
-1
This idea came from IBM Cognos BI – Using Dimensional Functions to Determine Current Period.
Example 2 – Find Current Period by Filtering on Measure Data
If the OLAP or DMR data source has been populated with time periods into the future (e.g. end of year or future years), then the calculation of current period is more complicated. However, it can still be determined by finding the latest period that has data for a given measure.
item(tail(filter(members([sales_and_marketing].[Time].[Time].[Month]),
tuple([Revenue], currentMember([sales_and_marketing].[Time].[Time]))
is not null), 1), 0)