Pandas groupby apply with multiindex performance issues - categories

import pandas as pd
import numpy as np
from collections import OrderedDict
import gc
import datetime
df = pd.DataFrame(np.random.rand(10000000, 5), columns=['A', 'B', 'C', 'D', 'E'])
df['A'] = 0.25*((df['A']/0.25).astype(int))
df['B'] = 0.25*((df['B']/0.25).astype(int))
df['A']=df['A'].astype(str)
df['B']=df['B'].astype(str)
df['ix1'] = df['A']
df['ix2'] = df['B']
df['A1']=df['A'].astype('category')
df['B1']=df['B'].astype('category')
gc.collect()
Question 1:
groupby and apply functions takes longer than dataframe.count, doing effective the same thing. how do i optimize here?
this takes ~17 seconds
df.groupby(['A', 'B']).apply(genSummary)
this takes only 3 seconds
df.groupby(['A', 'B']).count()
Question 2:
I need to apply custom functions to the groupby objects. After set multiindex, group by string is slower than without index...
def genSummary(group):
return pd.Series(OrderedDict([('Counts', np.count_nonzero(group['C'])),
('Sum', np.sum(group['D'])),
('Wavg', np.ma.average(group['E'], weights=group['C'])),
('Wavg', np.ma.average(group['E']*(group['C']>0.5), weights=group['C']*(group['C']>0))),
])
)
1 group by string takes ~8.1 seconds
df.groupby(['A', 'B']).apply(genSummary)
2 group by category takes ~6.3 seconds
df.groupby(['A1', 'B1']).apply(genSummary)
df.sort_values(['ix1', 'ix2'], inplace=True)
gc.collect()
df.set_index(['ix1', 'ix2'], inplace=True)
gc.collect()
3 set multiindex, group by string takes ~7.2 seconds
df.groupby(['A', 'B']).apply(genSummary)
4 set multiindex, group by index takes ~5. seconds
df.groupby(level=[0,1]).apply(genSummary)
5 set multiindex, group by category takes ~4.6 seconds
df.groupby(['A1', 'B1']).apply(genSummary)
df = df.reset_index()
gc.collect()
6 reset_index, group by category takes ~4.9 seconds
df.groupby(['A1', 'B1']).apply(genSummary)
7 reset_index, group by string takes ~6.2 seconds
df.groupby(['A', 'B']).apply(genSummary)

Related

How do I create new columns containing lead values generated from another column in Pyspark?

The following code pulls down daily oil prices (dcoilwtico), resamples the daily figures to monthly, calculates the 12-month (i.e. year over year percent) change and finally contains a loop to shift the YoY percent change ahead 1 month (dcoilwtico_1), 2 months (dcoilwtico_2) all the way out to 12 months (dcoilwtico_12) as new columns:
import pandas_datareader as pdr
start = datetime.datetime (2016, 1, 1)
end = datetime.datetime (2022, 12, 1)
#1. Get historic data
df_fred_daily = pdr.DataReader(['DCOILWTICO'],'fred', start, end).dropna().resample('M').mean() # Pull daily, remove NaN and collapse from daily to monthly
df_fred_daily.columns= df_fred_daily.columns.str.lower()
#2. Expand df range: index, column names
index_fred = pd.date_range('2022-12-31', periods=13, freq='M')
columns_fred_daily = df_fred_daily.columns.to_list()
#3. Append history + empty df
df_fred_daily_forecast = pd.DataFrame(index=index_fred, columns=columns_fred_daily)
df_fred_test_daily=pd.concat([df_fred_daily, df_fred_daily_forecast])
#4. New df, calculate yoy percent change for each commodity
df_fred_test_daily_yoy= ((df_fred_test_daily - df_fred_test_daily.shift(12))/df_fred_test_daily.shift(12))*100
#5. Extend each variable as a series from 1 to 12 months
for col in df_fred_test_daily_yoy.columns:
for i in range(1,13):
df_fred_test_daily_yoy["%s_%s"%(col,i)] = df_fred_test_daily_yoy[col].shift(i)
df_fred_test_daily_yoy.tail(18)
And produces the following df:
Question: My real world example contains hundreds of columns and I would like to generate these same results using Pyspark.
How would this be coded using Pyspark?
As your code is already ready, I would use koalas, "a pandas spark version", You just need to install https://pypi.org/project/koalas/
see the simple example
import databricks.koalas as ks
import pandas as pd
pdf = pd.DataFrame({'x':range(3), 'y':['a','b','b'], 'z':['a','b','b']})
# Create a Koalas DataFrame from pandas DataFrame
df = ks.from_pandas(pdf)
# Rename the columns
df.columns = ['x', 'y', 'z1']
# Do some operations in place:
df['x2'] = df.x * df.x

Pyspark write stream one column at a time

The source .csv has 414 columns each with a new date:
The count increases by the total number of COVID deaths up to that date.
I want to display in a Databricks dashboard a stream which will increment up as the total deaths to date increases. Iterating through the date columns from left to right for 412 days. I will insert a pause on the stream after each day, then ingest the next day's results. Displaying the total by state as it increments up with each day.
So far:
df = spark.read.option("header", "true").csv("/databricks-datasets/COVID/USAFacts/covid_deaths_usafacts.csv")
This initial df has 418 columns and I have changed all of the day columns to IntegerType; keeping only the State and County columns as string.
from pyspark.sql import functions as F
for col in temp_df.columns:
temp_df = temp_df.withColumn(
col,
F.col(col).cast("integer")
)
and then
from pyspark.sql.functions import col
temp_df.withColumn("County Name",col("County Name").cast('integer')).withColumn("State",col("State").cast('integer'))
Then I use df.schema to get the schema and do a second ingest of the .csv, this time with the schema defined. But my next challenge is the most difficult, to stream in the results one column at a time.
Or can I simply PIVOT ? If yes, then like this?
pivotDF = df.groupBy("State").pivot("County", countyFIPS)

PySpark Error using partition over a ranked column

I have two Spark Dataframes, the first one contains information related to Events as follows:
Id
User_id
Date
1
1
2021-08-15
2
2
2020-03-10
The second Dataframe contains information related to previous Purchase as below:
Id
User_id
Date
1
1
2021-07-15
2
1
2021-07-10
3
1
2021-04-12
4
2
2020-02-10
What I wondering to know is how to bring the quantity of purchase for each User 90 days prior to Event Date.
The code I'm using is:
(events.join(purchase,
on = [events.User_id == purchase.User_id,
events.Date >= purchase.Date],
how = "left")
.withColumn('rank_test', rank().over(W.partitionBy(purchase['User_id']).orderBy(col("Date").desc())))
.withColumn('is90days', when(floor((events["Date"].cast('long') - purchase["Date"].cast('long'))/86400) <= 90, 1).otherwise(0))
.where(col('is90days') == 1)
.withColumn('maxPurchase', max('rank_test').over(W.partitionBy(events['ID'])))
.where(col('rank_test') == col('maxPurchase'))
)
But I'm getting the following error:
AttributeError: 'str' object has no attribute 'over'
What I was expecting is a table as follows:
Id
User_id
Date
qtyPurchasePast90days
1
1
2021-08-15
2
2
2
2020-03-10
1
I appreciate your time in helping me!
Regards
In your code (line 8), Spark is confusing between Python built-in function max and Spark function max. And the main reason is that you (and many many others) import Spark functions in a not recommended way
# DON'T do this
from pyspark.sql.functions import *
max('rank_test') # 't'
# DO this
from pyspark.sql import functions as F
F.max('rank_test') # Column<'max(rank_test)'>

Structured spark streaming leftOuter joins behaves like inner join

I'm trying structured spark streaming stream-stream join, and my left outer joins behaves exactly same as inner join.
Using spark version 2.4.2 and Scala version 2.12.8, Eclipse OpenJ9 VM, 1.8.0_252
Here is what I'm trying to do,
Create rate stream which generates 1 row per second.
Create Employee and Dept stream out of it.
Employee stream deptId field multiplies rate value by 2 and Dept stream id field by 3
Purpose of doing this is to have two stream which have few common and not common id field.
Do leftOuter stream-stream join with time constraint of 30 sec and dept stream on left side of join.
Expectation:
After 30 seconds of time constraint, for unmatched rows, I should be see null on right side of join.
Whats happening
I only see rows where there was match between ids and not unmatched rows.
Code - trying on spark-shell
import java.sql.Timestamp
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
case class RateData(timestamp: Timestamp, value: Long)
// create rate source with 1 row per second.
val rateSource = spark.readStream.format("rate").option("rowsPerSecond", 1).option("numPartitions", 1).option("rampUpTime", 1).load()
import spark.implicits._
val rateSourceData = rateSource.as[RateData]
// employee stream departid increments by 2
val employeeStreamDS = rateSourceData.withColumn("firstName", concat(lit("firstName"),rateSourceData.col("value")*2)).withColumn("departmentId", lit(floor(rateSourceData.col("value")*2))).withColumnRenamed("timestamp", "empTimestamp").withWatermark("empTimestamp", "10 seconds")
// dept stream id increments by 3
val departmentStreamDS = rateSourceData.withColumn("name", concat(lit("name"),floor(rateSourceData.col("value")*3))).withColumn("Id", lit(floor(rateSourceData.col("value")*3))).drop("value").withColumnRenamed("timestamp", "depTimestamp")
// watermark - 10s and time constraint is 30 secs on employee stream.
val joinedDS = departmentStreamDS.join(employeeStreamDS, expr(""" id = departmentId AND empTimestamp >= depTimestamp AND empTimestamp <= depTimestamp + interval 30 seconds """), "leftOuter")
val q = joinedDS.writeStream.format("parquet").trigger(Trigger.ProcessingTime("60 seconds")).option("checkpointLocation", "checkpoint").option("path", "rate-output").start
I queried the output of the table after 10 mins and I only found 31 matching rows. which is same as inner join output.
val df = spark.read.parquet("rate-output")
df.count
res0: Long = 31
df.agg(min("departmentId"), max("departmentId")).show
+-----------------+-----------------+
|min(departmentId)|max(departmentId)|
+-----------------+-----------------+
| 0| 180|
+-----------------+-----------------+
Explanation of output.
employeeStreamDS stream, departmentId field value is 2 times rate value so it is multiples of two.
departmentStreamDS stream, Id field is 3 times rate stream value so multiple of 3.
so there would match of departmentId = Id for every 6 because, LCM(2,3) = 6.
that would happen until there is difference of 30 sec between those streams(join time constraint).
I would expect after 30 seconds, I would null values for dept stream values(3,9,15 .. ) and so on.
I hope I'm explaining it well enough.
So the result question about left-outer join behavior for spark streaming.
From my understanding and indeed according to https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#stream-stream-joins, you need to apply watermarks on event-time columns of both streams, like e.g.:
val impressionsWithWatermark = impressions.withWatermark("impressionTime", "2 hours")
val clicksWithWatermark = clicks.withWatermark("clickTime", "3 hours")
...
impressionsWithWatermark.join(
clicksWithWatermark,
expr("""
clickAdId = impressionAdId AND
clickTime >= impressionTime AND
clickTime <= impressionTime + interval 1 hour
"""),
joinType = "leftOuter" // can be "inner", "leftOuter", "rightOuter"
)
You have only one watermark defined.

Feature engineering of rolling windows with apache beam

I have been able to read in the following data representing customer transactions as csv with Beam (Python SDK).
timestamp,customer_id,amount
2018-02-08 12:04:36.899422,1,45.92615814813004
2019-04-05 07:40:17.873746,1,47.360044568200514
2019-07-27 04:37:48.060949,1,23.325754816230106
2017-05-18 15:46:41.654809,2,25.47369262400646
2018-08-08 03:59:05.791552,2,34.859367944028875
2019-01-02 02:44:35.208450,2,5.2753275435507705
2020-03-06 09:45:29.866731,2,35.656304542140404
2020-05-28 20:19:08.593375,2,23.23715711587539
The csv is being read in as follows:
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.io.textio import ReadFromText
import datetime
class Split(beam.DoFn):
def process(self, element):
timestamp, customer_id, amount = element.split(",")
return [{
'timestamp': timestamp,
'customer': int(customer_id),
'amount': float(amount)
}]
options = PipelineOptions()
with beam.Pipeline(options=options) as p:
rows = (
p |
ReadFromText('../data/sample_trxns.csv', skip_header_lines=1) |
beam.ParDo(Split())
)
class UnixTime(beam.DoFn):
def process(self, element):
"""
Returns a list of tuples containing customer and amount
"""
unix_time = datetime.datetime.strptime(
element['timestamp'],
"%Y-%m-%d %H:%M:%S.%f"
).timestamp()
return [{
'timestamp': unix_time,
'customer': element['customer'],
'amount': element['amount']
}]
class AddTimestampDoFn(beam.DoFn):
def process(self, element):
unix_timestamp = element['timestamp']
# Wrap and emit the current entry and new timestamp in a
# TimestampedValue.
yield beam.window.TimestampedValue(element, unix_timestamp)
timed_rows = (
rows |
beam.ParDo(UnixTime()) |
beam.ParDo(AddTimestampDoFn())
)
However with Beam I have been unable to derive rolling window features such as for 'customer mean transaction value over last 1000 days', and equivalent rolling window features for min, max and sum (excluding the current row in each calculation). This demonstrates the desired values of the feature calculating with the pandas.Series.rolling function and printing the resulting pandas dataframe:
customer_id amount mean_trxn_amount_last_1000_days
timestamp
2018-02-08 12:04:36.899422 1 45.926158 NaN
2019-04-05 07:40:17.873746 1 47.360045 45.926158
2019-07-27 04:37:48.060949 1 23.325755 46.643101
2017-05-18 15:46:41.654809 2 25.473693 NaN
2018-08-08 03:59:05.791552 2 34.859368 25.473693
2019-01-02 02:44:35.208450 2 5.275328 30.166530
2020-03-06 09:45:29.866731 2 35.656305 20.067348
2020-05-28 20:19:08.593375 2 23.237157 25.263667
I have not found any documentation for similar functionality in Beam - is such functionality available? If not, am I misunderstanding the intended scope of what Beam is meant to provide, or is this sort of functionality likely to be available in the future? Thanks.
You can make use of windowing as you have extracted the timestamps in your sample code.
Fixed windows:
"The simplest form of windowing is using fixed time windows: given a timestamped PCollection which might be continuously updating, each window might capture (for example) all elements with timestamps that fall into a 30 second interval."
Sliding windows:
"A sliding time window also represents time intervals in the data stream; however, sliding time windows can overlap. For example, each window might capture 60 seconds worth of data, but a new window starts every 30 seconds. The frequency with which sliding windows begin is called the period. Therefore, our example would have a window duration of 60 seconds and a period of 30 seconds."
Apply the window then make use of either the inbuilt functions for Min/Max/Sum etc... or create your own combiner.