Feature engineering of rolling windows with apache beam - apache-beam

I have been able to read in the following data representing customer transactions as csv with Beam (Python SDK).
timestamp,customer_id,amount
2018-02-08 12:04:36.899422,1,45.92615814813004
2019-04-05 07:40:17.873746,1,47.360044568200514
2019-07-27 04:37:48.060949,1,23.325754816230106
2017-05-18 15:46:41.654809,2,25.47369262400646
2018-08-08 03:59:05.791552,2,34.859367944028875
2019-01-02 02:44:35.208450,2,5.2753275435507705
2020-03-06 09:45:29.866731,2,35.656304542140404
2020-05-28 20:19:08.593375,2,23.23715711587539
The csv is being read in as follows:
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.io.textio import ReadFromText
import datetime
class Split(beam.DoFn):
def process(self, element):
timestamp, customer_id, amount = element.split(",")
return [{
'timestamp': timestamp,
'customer': int(customer_id),
'amount': float(amount)
}]
options = PipelineOptions()
with beam.Pipeline(options=options) as p:
rows = (
p |
ReadFromText('../data/sample_trxns.csv', skip_header_lines=1) |
beam.ParDo(Split())
)
class UnixTime(beam.DoFn):
def process(self, element):
"""
Returns a list of tuples containing customer and amount
"""
unix_time = datetime.datetime.strptime(
element['timestamp'],
"%Y-%m-%d %H:%M:%S.%f"
).timestamp()
return [{
'timestamp': unix_time,
'customer': element['customer'],
'amount': element['amount']
}]
class AddTimestampDoFn(beam.DoFn):
def process(self, element):
unix_timestamp = element['timestamp']
# Wrap and emit the current entry and new timestamp in a
# TimestampedValue.
yield beam.window.TimestampedValue(element, unix_timestamp)
timed_rows = (
rows |
beam.ParDo(UnixTime()) |
beam.ParDo(AddTimestampDoFn())
)
However with Beam I have been unable to derive rolling window features such as for 'customer mean transaction value over last 1000 days', and equivalent rolling window features for min, max and sum (excluding the current row in each calculation). This demonstrates the desired values of the feature calculating with the pandas.Series.rolling function and printing the resulting pandas dataframe:
customer_id amount mean_trxn_amount_last_1000_days
timestamp
2018-02-08 12:04:36.899422 1 45.926158 NaN
2019-04-05 07:40:17.873746 1 47.360045 45.926158
2019-07-27 04:37:48.060949 1 23.325755 46.643101
2017-05-18 15:46:41.654809 2 25.473693 NaN
2018-08-08 03:59:05.791552 2 34.859368 25.473693
2019-01-02 02:44:35.208450 2 5.275328 30.166530
2020-03-06 09:45:29.866731 2 35.656305 20.067348
2020-05-28 20:19:08.593375 2 23.237157 25.263667
I have not found any documentation for similar functionality in Beam - is such functionality available? If not, am I misunderstanding the intended scope of what Beam is meant to provide, or is this sort of functionality likely to be available in the future? Thanks.

You can make use of windowing as you have extracted the timestamps in your sample code.
Fixed windows:
"The simplest form of windowing is using fixed time windows: given a timestamped PCollection which might be continuously updating, each window might capture (for example) all elements with timestamps that fall into a 30 second interval."
Sliding windows:
"A sliding time window also represents time intervals in the data stream; however, sliding time windows can overlap. For example, each window might capture 60 seconds worth of data, but a new window starts every 30 seconds. The frequency with which sliding windows begin is called the period. Therefore, our example would have a window duration of 60 seconds and a period of 30 seconds."
Apply the window then make use of either the inbuilt functions for Min/Max/Sum etc... or create your own combiner.

Related

How do I create new columns containing lead values generated from another column in Pyspark?

The following code pulls down daily oil prices (dcoilwtico), resamples the daily figures to monthly, calculates the 12-month (i.e. year over year percent) change and finally contains a loop to shift the YoY percent change ahead 1 month (dcoilwtico_1), 2 months (dcoilwtico_2) all the way out to 12 months (dcoilwtico_12) as new columns:
import pandas_datareader as pdr
start = datetime.datetime (2016, 1, 1)
end = datetime.datetime (2022, 12, 1)
#1. Get historic data
df_fred_daily = pdr.DataReader(['DCOILWTICO'],'fred', start, end).dropna().resample('M').mean() # Pull daily, remove NaN and collapse from daily to monthly
df_fred_daily.columns= df_fred_daily.columns.str.lower()
#2. Expand df range: index, column names
index_fred = pd.date_range('2022-12-31', periods=13, freq='M')
columns_fred_daily = df_fred_daily.columns.to_list()
#3. Append history + empty df
df_fred_daily_forecast = pd.DataFrame(index=index_fred, columns=columns_fred_daily)
df_fred_test_daily=pd.concat([df_fred_daily, df_fred_daily_forecast])
#4. New df, calculate yoy percent change for each commodity
df_fred_test_daily_yoy= ((df_fred_test_daily - df_fred_test_daily.shift(12))/df_fred_test_daily.shift(12))*100
#5. Extend each variable as a series from 1 to 12 months
for col in df_fred_test_daily_yoy.columns:
for i in range(1,13):
df_fred_test_daily_yoy["%s_%s"%(col,i)] = df_fred_test_daily_yoy[col].shift(i)
df_fred_test_daily_yoy.tail(18)
And produces the following df:
Question: My real world example contains hundreds of columns and I would like to generate these same results using Pyspark.
How would this be coded using Pyspark?
As your code is already ready, I would use koalas, "a pandas spark version", You just need to install https://pypi.org/project/koalas/
see the simple example
import databricks.koalas as ks
import pandas as pd
pdf = pd.DataFrame({'x':range(3), 'y':['a','b','b'], 'z':['a','b','b']})
# Create a Koalas DataFrame from pandas DataFrame
df = ks.from_pandas(pdf)
# Rename the columns
df.columns = ['x', 'y', 'z1']
# Do some operations in place:
df['x2'] = df.x * df.x

Pyspark write stream one column at a time

The source .csv has 414 columns each with a new date:
The count increases by the total number of COVID deaths up to that date.
I want to display in a Databricks dashboard a stream which will increment up as the total deaths to date increases. Iterating through the date columns from left to right for 412 days. I will insert a pause on the stream after each day, then ingest the next day's results. Displaying the total by state as it increments up with each day.
So far:
df = spark.read.option("header", "true").csv("/databricks-datasets/COVID/USAFacts/covid_deaths_usafacts.csv")
This initial df has 418 columns and I have changed all of the day columns to IntegerType; keeping only the State and County columns as string.
from pyspark.sql import functions as F
for col in temp_df.columns:
temp_df = temp_df.withColumn(
col,
F.col(col).cast("integer")
)
and then
from pyspark.sql.functions import col
temp_df.withColumn("County Name",col("County Name").cast('integer')).withColumn("State",col("State").cast('integer'))
Then I use df.schema to get the schema and do a second ingest of the .csv, this time with the schema defined. But my next challenge is the most difficult, to stream in the results one column at a time.
Or can I simply PIVOT ? If yes, then like this?
pivotDF = df.groupBy("State").pivot("County", countyFIPS)

Cumulative function in spark scala

I have tried this to calculate cumulate value but if the date field is same those values are added in the cumulative field, can someone suggestion solution Similar to this question
val windowval = (Window.partitionBy($"userID").orderBy($"lastModified")
.rangeBetween(Window.unboundedPreceding, 0))
val df_w_cumsum = ms1_userlogRewards.withColumn("totalRewards", sum($"noOfJumps").over(windowval)).orderBy($"lastModified".asc)
df_w_cumsum.filter($"batchType".isNull).filter($"userID"==="355163").select($"userID", $"noOfJumps", $"totalRewards",$"lastModified").show()
Note that your very first totalRewards=147 is the sum of the previous value 49 + all the values with timestamp "2019-08-07 18:25:06": 49 + (36 + 0 + 60 + 2) = 147.
The first option would be to aggregate all the values with the same timestamp fist e.g. groupBy($"userId", $"lastModified").agg(sum($"noOfJumps").as("noOfJumps")) (or something like that) and then run your aggregate sum. This will remove duplicate timestamps altogether.
The second option is to use row_number to define an order among rows with the same lastModified field first and then run your aggregate sum with .orderBy($"lastModified, $"row_number") (or something like that). This should keep all records and give you partial sum up along the way: totalRewards = 49 -> 85 -> 85 -> 145 -> 147 (or something similar depending on the order defined by row_number)
I think you want to sum by userid and timestamp.
So, You need to partition by userid and date and use window function to sym like the following:
import org.apache.spark.sql.functions.sum
import org.apache.spark.sql.expressions.Window
val window = Window.partitionBy("userID", "lastModified")
df.withColumn("cumulativeSum", sum(col("noOfJumps").over(window))

Postgres Function: how to return the first full set of data that occurs after specified date/time

I have a requirement to extract rows of data, but only if all said rows make a full set. We have a sequence table that is updated every minute, with data for 80 bins. We need to know the status of bins 1 thru 80 every minute as part of our production process.
I am generating a new report (postgres function) that needs to take a snapshot at roughly 00:01:00:AM (IE 1 minute past midnight). Initially I thougtht this to be an easy task, just grab the first 80 rows of data that occur at/after this time, however I see that, depending on network activity and industrial computer priorities, the table is not religiously updated at exactly 00:01:00AM or any minute for that matter. Updates can occur milliseconds or even seconds later, and take 500ms to 800ms to update the database. Sometimes a given minute can be missing altogether (production processes take precedence over data capture, but the sequence data is not super critical anyway)
My thinking is it would be more reliable to look for the first complete set of data anytime from 00:01:00AM onwards. So effectively, I have a table that looks a bit like this:
Apologies, I know you prefer for images of this manner to not be pasted in this manner, but I could not figure out how to create a textual table like this here (carriage return or Enter button is ignored!)
Basically, the above table is typical, but 1st minute is not guaranteed, and for that matter, I would not be 100% confident that all 80 bins are logged for a given minute. Hence my question: how to return the first complete set of data, where all 80 bins (rows) have been captured for a particular minute?
Thinking about it, I could do some sort of rowcount in the function, ensuring there are 80 rows for a given minute, but this seems less intuitive. I would like to know for sure that for each row of a given minute, bin 1 is represented, bint 2, bin 3...
Ultimately a call to this function will supply a min/max date/time and that period of time will be checked for the first available minute with a full set of bins data.
I am reasonably sure this will involve a window function, as all rows have to be assessed prior to data extraction. I've used windows functions a few times now, but still a green newbie compared to others here, so help is appreciated.
My final code, thanks to help from #klin:-
StartTime = DATE_TRUNC('minute', tme1);
EndTime = DATE_TRUNC('day', tme1) + '23 hours'::interval;
SELECT "BinSequence".*
FROM "BinSequence"
JOIN(
SELECT "binMinute" AS binminute, count("binMinute")
FROM "BinSequence"
WHERE ("binTime" >= StartTime) AND ("binTime" < EndTime)
GROUP BY 1
HAVING COUNT (DISTINCT "binBinNo") = 80 -- verifies that each and every bin is represented in returned data
) theseTuplesOnly
ON theseTuplesOnly.binminute = "binMinute"
WHERE ("binTime" >= StartTime) AND ("binTime" < EndTime)
GROUP BY 1
ORDER BY 1
LIMIT 80
Use the aggregate function count(*) grouping data by minutes (date_trunc('minute', datestamp) gives full minutes from datestamp), e.g.:
create table bins(datestamp time, bin int);
insert into bins values
('00:01:10', 1, 'a'),
('00:01:20', 2, 'b'),
('00:01:30', 3, 'c'),
('00:01:40', 4, 'd'),
('00:02:10', 3, 'e'),
('00:03:10', 2, 'f'),
('00:03:10', 3, 'g'),
('00:03:10', 4, 'h');
select date_trunc('minute', datestamp) as minute, count(bin)
from bins
group by 1
order by 1
minute | count
----------+-------
00:01:00 | 4
00:02:00 | 1
00:03:00 | 3
(3 rows)
If you are not sure that all bins are unique in consecutive minutes, use distinct (this will make the query slower):
select date_trunc('minute', datestamp) as minute, count(distinct bin)
...
You cannot select counts in aggregated minnutes and all columns of the table in a single simple select. If you want to do that, you should join a derived table or use the operator in or use a window function. A join seems to be the simplest:
select b.*, count
from bins b
join (
select date_trunc('minute', datestamp) as minute, count(bin)
from bins
group by 1
having count(bin) = 4
) s
on date_trunc('minute', datestamp) = minute
order by 1;
datestamp | bin | param | count
-----------+-----+-------+-------
00:01:10 | 1 | a | 4
00:01:20 | 2 | b | 4
00:01:30 | 3 | c | 4
00:01:40 | 4 | d | 4
(4 rows)
Note also how to use having() to filter results in the above query.
You can test the query here.

Min value with GROUP BY in Power BI Desktop

id datetime new_column datetime_rankx
1 12.01.2015 18:10:10 12.01.2015 18:10:10 1
2 03.12.2014 14:44:57 03.12.2014 14:44:57 1
2 21.11.2015 11:11:11 03.12.2014 14:44:57 2
3 01.01.2011 12:12:12 01.01.2011 12:12:12 1
3 02.02.2012 13:13:13 01.01.2011 12:12:12 2
3 03.03.2013 14:14:14 01.01.2011 12:12:12 3
I want to make new column, which will have minimum datetime value for each row in group by id.
How could I do it in Power BI desktop using DAX query?
Use this expression:
NewColumn =
CALCULATE(
MIN(
Table[datetime]),
FILTER(Table,Table[id]=EARLIER(Table[id])
)
)
In Power BI using a table with your data it will produce this:
UPDATE: Explanation and EARLIER function usage.
Basically, EARLIER function will give you access to values of different row context.
When you use CALCULATE function it creates a row context of the whole table, theoretically it iterates over every table row. The same happens when you use FILTER function it will iterate on the whole table and evaluate every row against the filter condition.
So far we have two row contexts, the row context created by CALCULATE and the row context created by FILTER. Note FILTER use the EARLIER to get access to the CALCULATE's row context. Having said that, in our case for every row in the outer (CALCULATE's row context) the FILTER returns a set of rows that correspond to the current id in the outer context.
If you have a programming background it could give you some sense. It is similar to a nested loop.
Hope this Python code points the main idea behind this:
outer_context = ['row1','row2','row3','row4']
inner_context = ['row1','row2','row3','row4']
for outer_row in outer_context:
for inner_row in inner_context:
if inner_row == outer_row: #this line is what the FILTER and EARLIER do
#Calculate the min datetime using the filtered rows
...
...
UPDATE 2: Adding a ranking column.
To get the desired rank you can use this expression:
RankColumn =
RANKX(
CALCULATETABLE(Table,ALLEXCEPT(Table,Table[id]))
,Table[datetime]
,Hoja1[datetime]
,1
)
This is the table with the rank column:
Let me know if this helps.