Is it bad to use `GroupBy` multiple times in pyspark? - pyspark

This is an educational question.
I have a text file containing several records of power consumption of factories - identified by a unique id -. The file contains the following columns
factory_id, city, country, date, consumption
where date is in the format mm/YYYY. I want to compute which countries have less than 20 cities (including those with 0) which experienced a decrease in factories' consumption in two consecutive years. This is nothing but the total yearly consumption of the factories located in that city.
To do this, I used multiple times a groupBy + agg as follows
import pyspark.sql.functions as F
import pyspark.sql.types as T
df = df.withColumn("year", F.split("Date", "/")[1])
# compute for each city the yearly consumption
df_consump = df.groupBy("Country", "City", "year").agg(
F.sum("consumption").alias("consumption")
)
#F.udf(returnType=T.IntegerType())
def had_a_decrease(structs):
structs = sorted(structs, key=lambda s: s.year)
# retrieve 0 if list is monotonically growing, 1 otherwise
cur_cons = pairs[0].consumption
for struct in structs[1:]:
cons = struct.consumption
if cons <= cur_cons:
return 1
cur_cons = cons
return 0
df_cons_decrease = df_consump.groupBy("Country", "City").agg(
# here I collect a list of structs containing (year, consumption)
# which is needed because collect_list doesn't guarantee the order
# is respected so I keep the info on the year to sort this (small)
# list first in the udf "had_a_decrease" defined above.
# eventually this yields a column with a 1 if we had a decrease, 0 otherwise,
# which I sum afterwards.
had_a_decrease(F.collect_list(F.struct("year", "consumption"))).alias("had_decrease")
)
df_cons_decrease.groupBy("Country").agg(
F.sum("had_decrease").alias("num_cities_with_decrease")
).filter("num_cities_with_decrease < 20")\
.write.csv(outputFolder)
however I was wondering:
is this a bad practice (e.g. inefficient) ?
are dataframe better suited than RDDs for this ?
would you recommend a better approach than grouping this many times ?

Compare the consumption with the consomption 1 year and 2 year ago by using Window and lag function without udf and then group by.
data = [
[1, 1, 1, '01/2022', 100],
[1, 1, 1, '01/2021', 90],
[1, 1, 1, '01/2020', 80],
[1, 1, 2, '01/2022', 100],
[1, 1, 2, '01/2021', 110],
[1, 1, 2, '01/2020', 120]
]
cols = ['factory_id', 'city', 'country', 'date', 'consumption']
df = spark.createDataFrame(data, cols) \
.withColumn('year', f.split('date', '/')[1])
w = Window.partitionBy('country', 'city').orderBy('year')
df.groupBy('country', 'city', 'year') \
.agg(f.sum('consumption').alias('consumption')) \
.withColumn('consumption-1', f.lag('consumption', 1).over(w)) \
.withColumn('consumption-2', f.lag('consumption', 2).over(w)) \
.withColumn('is_decreased', f.expr('if(`consumption` < `consumption-1` and `consumption-1` < `consumption-2`, true, false)')) \
.filter('is_decreased = true') \
.select('country', 'city').distinct() \
.groupBy('country').count() \
.filter('count < 20') \
.select('country') \
.show()
+-------+
|country|
+-------+
| 2|
+-------+

Related

Using a grouped z-score over a rolling window

I would like to calculate a z-score over a bin based on the data of a rolling look-back period.
Example
Todays visitor amount during [9:30-9:35) should be z-score normalized based off the (mean, std) of the last 3 days of visitors that visited during [9:30-9:35).
My current attempts both raise InvalidOperationError. Is there a way in polars to calculate this?
import polars as pl
def z_score(col: str, over: str, alias: str):
# calculate z-score normalized `col` over `over`
return (
(pl.col(col)-pl.col(col).over(over).mean()) / pl.col(col).over(over).std()
).alias(alias)
df = pl.from_dict(
{
"timestamp": pd.date_range("2019-12-02 9:30", "2019-12-02 12:30", freq="30s").union(
pd.date_range("2019-12-03 9:30", "2019-12-03 12:30", freq="30s")
),
"visitors": [(e % 2) + 1 for e in range(722)]
}
# 5 minute bins for grouping [9:30-9:35) -> 930
).with_column(
pl.col("timestamp").dt.truncate(every="5m").dt.strftime("%H%M").cast(pl.Int32).alias("five_minute_bin")
).with_column(
pl.col("timestamp").dt.truncate(every="3d").alias("daytrunc")
)
# normalize visitor amount for each 5 min bin over the rolling 3 day window using z-score.
# not rolling but also wont work (InvalidOperationError: window expression not allowed in aggregation)
# df.with_column(
# z_score("visitors", "five_minute_bin", "normalized").over("daytrunc")
# )
# won't work either (InvalidOperationError: window expression not allowed in aggregation)
#df.groupby_rolling(index_column="daytrunc", period="3i").agg(z_score("visitors", "five_minute_bin", "normalized"))
For an example of 4 days of data with four data-points each lying in two time-bins ({0,0} - {0,1}), ({1,0} - {1,1})
Input:
Day 0: x_d0_{0,0}, x_d0_{0,1}, x_d0_{1,0}, x_d0_{1,1}
Day 1: x_d1_{0,0}, x_d1_{0,1}, x_d1_{1,0}, x_d1_{1,1}
Day 2: x_d2_{0,0}, x_d2_{0,1}, x_d2_{1,0}, x_d2_{1,1}
Day 3: x_d3_{0,0}, x_d3_{0,1}, x_d3_{1,0}, x_d3_{1,1}
Output:
Day 0: norm_x_d0_{0,0} = nan, norm_x_d0_{0,1} = nan, norm_x_d0_{1,0} = nan, norm_x_d0_{1,1} = nan
Day 1: norm_x_d1_{0,0} = nan, norm_x_d1_{0,1} = nan, norm_x_d1_{1,0} = nan, norm_x_d1_{1,1} = nan
Day 2: norm_x_d2_{0,0} = nan, norm_x_d2_{0,1} = nan, norm_x_d2_{1,0} = nan, norm_x_d2_{1,1} = nan
Day 3: norm_x_d3_{0,0} = (x_d3_{0,0} - np.mean([x_d0_{0,0}, x_d0_{0,1}, X_d1_{0,0}, ..., x_d3_{0,1}]) / np.std([x_d0_{0,0}, x_d0_{0,1}, X_d1_{0,0}, ..., x_d3_{0,1}])) , ... ,
They key here is to use over to restrict your calculations to the five minute bins and then use the rolling functions to get the rolling mean and standard deviation over days restricted by those five minute bin keys. five_minute_bin works as in your code and I believe that a truncated day_bin is necessary so that, for example, 9:33 on one day will include 9:31 both 9:34 on the same and 9:31 from 2 days ago.
days = 5
pl.DataFrame(
{
"timestamp": pl.concat(
[
pl.date_range(
datetime(2019, 12, d, 9, 30), datetime(2019, 12, d, 12, 30), "30s"
)
for d in range(2, days + 2)
]
),
"visitors": [(e % 2) + 1 for e in range(days * 361)],
}
).with_columns(
five_minute_bin=pl.col("timestamp").dt.truncate(every="5m").dt.strftime("%H%M"),
day_bin=pl.col("timestamp").dt.truncate(every="1d"),
).with_columns(
standardized_visitors=(
(
pl.col("visitors")
- pl.col("visitors").rolling_mean("3d", by="day_bin", closed="right")
)
/ pl.col("visitors").rolling_std("3d", by="day_bin", closed="right")
).over("five_minute_bin")
)
Now, that said, when trying out the code for this, I found polars doesn't handle non-unique values in the by-column in the rolling function correctly, so that the same values in the same 5-minute bin don't end up as the same standardized values. Opened bug report here: https://github.com/pola-rs/polars/issues/6691. For large amounts of real world data, this shouldn't actually matter that much, unless your data systematically differs in distribution within the 5 minute bins.

Polars Dataframe: Apply MinMaxScaler to a column with condition

I am trying to perform the following operation in Polars.
For value in column B which is below 80 will be scaled between 1 and 4, where as for anything above 80, will be set as 5.
df_pandas = pd.DataFrame(
{
"A": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
"B": [50, 300, 80, 12, 105, 78, 66, 42, 61.5, 35],
}
)
test_scaler = MinMaxScaler(feature_range=(1,4)) # from sklearn.preprocessing
df_pandas.loc[df_pandas['B']<80, 'Test'] = test_scaler.fit_transform(df_pandas.loc[df_pandas['B']<80, "B"].values.reshape(-1,1))
df_pandas = df_pandas.fillna(5)
This is what I did with Polars:
# dt is a dictionary
dt = df.filter(
pl.col('B')<80
).to_dict(as_series=False)
below_80 = list(dt.keys())
dt_scale = list(
test_scaler.fit_transform(
np.array(dt['B']).reshape(-1,1)
).reshape(-1) # reshape back to one dimensional
)
# reassign to dictionary dt
dt['B'] = dt_scale
dt_scale_df = pl.DataFrame(dt)
dt_scale_df
dummy = df.join(
dt_scale_df, how="left", on="A"
).fill_null(5)
dummy = dummy.rename({"B_right": "Test"})
Result:
A
B
Test
1
50.0
2.727273
2
300.0
5.000000
3
80.0
5.000000
4
12.0
1.000000
5
105.0
5.000000
6
78.0
4.000000
7
66.0
3.454545
8
42.0
2.363636
9
61.5
3.250000
10
35.0
2.045455
Is there a better approach for this?
Alright, I have got 3 examples for you that should help you from which the last should be preferred.
Because you only want to apply your scaler to a part of a column, we should ensure we only send that part of the data to the scaler. This can be done by:
window function over a partition
partition_by
when -> then -> otherwise + min_max expression
Window function over partititon
This requires a python function that will be applied over the partitions. In the function itself we then have to check in which partition we are and deal with it accordingly.
df = pl.from_pandas(df_pandas)
min_max_sc = MinMaxScaler((1, 4))
def my_scaler(s: pl.Series) -> pl.Series:
if s.len() > 0 and s[0] > 80:
out = (s * 0 + 5)
else:
out = pl.Series(min_max_sc.fit_transform(s.to_numpy().reshape(-1, 1)).flatten())
# ensure all types are the same
return out.cast(pl.Float64)
df.with_column(
pl.col("B").apply(my_scaler).over(pl.col("B") < 80).alias("Test")
)
partition_by
This partitions the the original dataframe to a dictionary holding the different partitions. We then only modify the partitions as needed.
parts = (df
.with_column((pl.col("B") < 80).alias("part"))
.partition_by("part", as_dict=True)
)
parts[True] = parts[True].with_column(
pl.col("B").map(
lambda s: pl.Series(min_max_sc.fit_transform(s.to_numpy().reshape(-1, 1)).flatten())
).alias("Test")
)
parts[False] = parts[False].with_column(
pl.lit(5.0).alias("Test")
)
pl.concat([df for df in parts.values()]).select(pl.all().exclude("part"))
when -> then -> otherwise + min_max expression
This one I like best. We can make function that creates a polars expression that is the min_max scaling function you need. This will have best performance.
def min_max_scaler(col: str, predicate: pl.Expr):
x = pl.col(col)
x_min = x.filter(predicate).min()
x_max = x.filter(predicate).max()
# * 3 + 1 to set scale between 1 - 4
return (x - x_min) / (x_max - x_min) * 3 + 1
predicate = pl.col("B") < 80
df.with_column(
pl.when(predicate)
.then(min_max_scaler("B", predicate))
.otherwise(5).alias("Test")
)

Create LineString from Lat/Lon columns using PySpark

I have a PySpark dataframe containing Lat/Lon points for different trajectories identified by a column "trajectories_id".
trajectory_id
latitude
longitude
1
45
5
1
45
6
1
45
7
2
46
5
2
46
6
2
46
7
What I want to do is to extract for each trajectory_id a LineString and store it in another dataframe, where each row represents a trajectory with "id" and "geometry" columns. In this example, the output should be:
trajectory_id
geometry
1
LINESTRING (5 45, 6 45, 7 45)
2
LINESTRING (5 46, 6 46, 7 46)
This is similar to what has been asked in this question, but in my case I need to use PySpark.
I have tried the following:
import pandas as pd
from shapely.geometry import Point,LineString
df = pd.DataFrame([[1, 45,5], [1, 45,6], [1, 45,7],[2, 46,5], [2, 46,6], [2, 46,7]], columns=['trajectory_id', 'latitude','longitude'])
df1 = spark.createDataFrame(df)
idx_ = df1.select("trajectory_id").rdd.flatMap(lambda x: x).distinct().collect()
geo_df = pd.DataFrame(index=range(len(idx_)),columns=['geometry','trajectory_id'])
k=0
for i in idx_:
df2=df1.filter(F.col("trajectory_id").isin(i)).toPandas()
df2['points']=df2[["longitude", "latitude"]].apply(Point, axis=1)
geo_df.geometry.iloc[k]=str(LineString(df2['points']))
geo_df['trajectory_id'].iloc[k]=i
k=k+1
This code works, but as in my task I am working with many more trajectories (> 2milions), this takes forever as I am converting to Pandas in each iteration.
Is there a way I can obtain the same output in a more efficient way?
As mentioned, I know that using toPandas() (and/or collect() ) is something I should avoid, especially inside a for loop
You can do this by using pyspark SQL's native functions.
import pyspark.sql.functions as func
long_lat_df = df.withColumn('joined_long_lat', func.concat(func.col("longitude"), func.lit(" "), func.col("latitude")));
grouped_df = long_lat_df .groupby('trajectory_id').agg(func.collect_list('joined_long_lat').alias("geometry"))
final_df = grouped_df.withColumn('geometry', func.concat_ws(", ", func.col("geometry")));

pyspark randomForest feature importance: how to get column names from the column numbers

I am using the standard (string indexer + one hot encoder + randomForest) pipeline in spark, as shown below
labelIndexer = StringIndexer(inputCol = class_label_name, outputCol="indexedLabel").fit(data)
string_feature_indexers = [
StringIndexer(inputCol=x, outputCol="int_{0}".format(x)).fit(data)
for x in char_col_toUse_names
]
onehot_encoder = [
OneHotEncoder(inputCol="int_"+x, outputCol="onehot_{0}".format(x))
for x in char_col_toUse_names
]
all_columns = num_col_toUse_names + bool_col_toUse_names + ["onehot_"+x for x in char_col_toUse_names]
assembler = VectorAssembler(inputCols=[col for col in all_columns], outputCol="features")
rf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="features", numTrees=100)
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel", labels=labelIndexer.labels)
pipeline = Pipeline(stages=[labelIndexer] + string_feature_indexers + onehot_encoder + [assembler, rf, labelConverter])
crossval = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=evaluator,
numFolds=3)
cvModel = crossval.fit(trainingData)
now after the the fit I can get the random forest and the feature importance using cvModel.bestModel.stages[-2].featureImportances, but this does not give me feature/ column names, rather just the feature number.
What I get is below:
print(cvModel.bestModel.stages[-2].featureImportances)
(1446,[3,4,9,18,20,103,766,981,983,1098,1121,1134,1148,1227,1288,1345,1436,1444],[0.109898803421,0.0967396441648,4.24568235244e-05,0.0369705839109,0.0163489685127,3.2286694534e-06,0.0208192703688,0.0815822887175,0.0466903663708,0.0227619959989,0.0850922269211,0.000113388896956,0.0924779490403,0.163835022713,0.118987129392,0.107373548367,3.35577640585e-05,0.000229569946193])
How can I map it back to some column names or column name + value format?
Basically to get the feature importance of random forest along with the column names.
The transformed dataset metdata has the required attributes.Here is an easy way to do -
create a pandas dataframe (generally feature list will not be huge, so no memory issues in storing a pandas DF)
pandasDF = pd.DataFrame(dataset.schema["features"].metadata["ml_attr"]
["attrs"]["binary"]+dataset.schema["features"].metadata["ml_attr"]["attrs"]["numeric"]).sort_values("idx")
Then create a broadcast dictionary to map. broadcast is necessary in a distributed environment.
feature_dict = dict(zip(pandasDF["idx"],pandasDF["name"]))
feature_dict_broad = sc.broadcast(feature_dict)
You can also look here and here
Hey why don't you just map it back to the original columns through list expansion. Here is an example:
# in your case: trainingData.columns
data_frame_columns = ["A", "B", "C", "D", "E", "F"]
# in your case: print(cvModel.bestModel.stages[-2].featureImportances)
feature_importance = (1, [1, 3, 5], [0.5, 0.5, 0.5])
rf_output = [(data_frame_columns[i], feature_importance[2][j]) for i, j in zip(feature_importance[1], range(len(feature_importance[2])))]
dict(rf_output)
{'B': 0.5, 'D': 0.5, 'F': 0.5}
I was not able to find any way to get the true initial list of the columns back after the ml algorithm, I am using this as the current workaround.
print(len(cols_now))
FEATURE_COLS=[]
for x in cols_now:
if(x[-6:]!="catVar"):
FEATURE_COLS+=[x]
else:
temp=trainingData.select([x[:-7],x[:-6]+"tmp"]).distinct().sort(x[:-6]+"tmp")
temp_list=temp.select(x[:-7]).collect()
FEATURE_COLS+=[list(x)[0] for x in temp_list]
print(len(FEATURE_COLS))
print(FEATURE_COLS)
I have kept a consistent suffix naming across all the indexer (_tmp) & encoder (_catVar) like:
column_vec_in = str_col
column_vec_out = [col+"_catVar" for col in str_col]
indexers = [StringIndexer(inputCol=x, outputCol=x+'_tmp')
for x in column_vec_in ]
encoders = [OneHotEncoder(dropLast=False, inputCol=x+"_tmp", outputCol=y)
for x,y in zip(column_vec_in, column_vec_out)]
tmp = [[i,j] for i,j in zip(indexers, encoders)]
tmp = [i for sublist in tmp for i in sublist]
This can be further improved and generalized, but currently this tedious work around works best

PySpark : how to split data without randomnize

there are function that can randomize spilt data
trainingRDD, validationRDD, testRDD = RDD.randomSplit([6, 2, 2], seed=0L)
I'm curious if there a way that we generate data the same partition ( train 60 / valid 20 / test 20 ) but without randommize ( let's just say use the current data to split first 60 = train, next 20 =valid and last 20 are for test data)
is there a possible way to split data similar way to split but not randomize?
The basic issue here is that unless you have an index column in your data, there is no concept of "first rows" and "next rows" in your RDD, it's just an unordered set. If you have an integer index column you could do something like this:
train = RDD.filter(lambda r: r['index'] % 5 <= 3)
validation = RDD.filter(lambda r: r['index'] % 5 == 4)
test = RDD.filter(lambda r: r['index'] % 5 == 5)