pyspark explode performance - pyspark

Background
I use explode to transpose columns to rows.
This works very well in general with good performance.
The source dataframe (df_audit in below code) is dynamic so can contain different structure.
Problem
Recently have incoming dataframe with very large number of columns (5 thousand) - the below code runs successfully but is very slow to run the line starting 'exploded'.
Anyone faced similar problems? I could split up the dataframe to multiple dataframes (broken out by columns) or might there be better way? Or example code?
Example code
key_cols = ["cola", "colb", "colc"]
cols = [col for col in df_audit.columns if col not in key_cols]
exploded = explode(array([struct(lit(c).alias("key"), col(c).alias("val")) for c in cols])).alias("exploded")
df_audit = df_audit.select(key_cols + [exploded]).select(key_cols + ["exploded.key", "exploded.val"])

Both lit() and col() are for some reason quite slow when used in a loop. You can try instead with arrays_zip():
exploded = explode(
arrays_zip(split(lit(','.join(cols)), ',').alias('key'), array(cols).alias('val'))
).alias('exploded')
In my quick test on 5k columns, this runs for ~6s vs. original ~25s.

Sharing some timings for bzu's approach and OP's approach based on colaboratory notebook.
cols = ['i'+str(i) for i in range(5000)]
# OP's method
%timeit func.array(*[func.struct(func.lit(k).alias('k'), func.col(k).alias('v')) for k in cols])
# 34.7 s ± 2.84 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
# bzu's method
%timeit func.arrays_zip(func.split(func.lit(','.join(cols)), ',').alias('k'), func.array(cols).alias('v'))
# 10.7 s ± 1.41 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Thank you bzu & samkart but for some reason I cannot get the new line working.
I have created a simple example that doesn't work as follows if you can see something obvious I am missing.
from pyspark.sql.functions import (
array, arrays_zip, coalesce, col, explode, lit, lower, split, struct,substring,)
from pyspark.sql.types import StringType
def process_data():
try:
logger.info("\ntest 1")
df_audit = spark.createDataFrame([("1", "foo", "abc", "xyz"),("2", "bar", "def", "zab"),],["id", "label", "colx", "coly"])
logger.info("\ntest 2")
key_cols = ["id", "label"]
cols = [col for col in df_audit.columns if col not in key_cols]
logger.info("\ntest 3")
# exploded = explode(array([struct(lit(c).alias("key"), col(c).alias("val")) for c in cols])).alias("exploded")
exploded = explode(arrays_zip(split(lit(','.join(cols)), ',').alias('key'), array(cols).alias('val'))).alias('exploded')
logger.info("\ntest 4")
df_audit = df_audit.select(key_cols + [exploded]).select(key_cols + ["exploded.key", "exploded.val"])
df_audit.show()
except Exception as e:
logger.error("Error in process_audit_data: {}".format(e))
return False
return True
When I call process_data function I get following logged:
test 1
test 2
test 3
test 4
Error in process_audit_data: No such struct field key in 0, 1.
Note: it does work successfully with the commented exploded line
Many thanks

Related

Cast a list column into dummy columns in Python Polars?

I have a very large data frame where there is a column that is a list of numbers representing category membership.
Here is a dummy version
import pandas as pd
import numpy as np
segments = [str(i) for i in range(1_000)]
# My real data is ~500m rows
nums = np.random.choice(segments, (100_000,10))
df = pd.DataFrame({'segments': [','.join(n) for n in nums]})
userId
segments
0
885,106,49,138,295,254,26,460,0,844
1
908,709,454,966,151,922,666,886,65,708
2
664,713,272,241,301,498,630,834,702,289
3
60,880,906,471,437,383,878,369,556,876
4
817,183,365,171,23,484,934,476,273,230
...
...
Note that there is a known list of segments (0-999 in the example)
I want to cast this into dummy columns indicating membership to each segment.
I found a few ways of doing this:
In pandas:
df_one_hot_encoded = (df['segments']
.str.split(',')
.explode()
.reset_index()
.assign(__one__=1)
.pivot_table(index='index', columns='segments', values='__one__', fill_value=0)
)
(takes 8 seconds on a 100k row sample)
And polars
df2 = pl.from_pandas(df[['segments']])
df_ans = (df2
.with_columns([
pl.arange(0, len(df2)).alias('row_index'),
pl.col('segments').str.split(','),
pl.lit(1).alias('__one__')
])
.explode('segments')
.pivot(index='row_index', columns='segments', values='__one__')
.fill_null(0)
)
df_one_hot_encoded = df_ans.to_pandas()
(takes 1.5 seconds inclusive of the conversion to and from pandas, .9s without)
However, I hear .pivot is not efficient, and that it does not work well with lazy frames.
I tried other solutions in polars, but they were much slower:
_ = df2.lazy().with_columns(**{segment: pl.col('segments').str.contains(segment) for segment in segments}).collect()
(2 seconds)
(df2
.with_columns([
pl.arange(0, len(df2)).alias('row_index'),
pl.col('segments').str.split(',')
])
.explode('segments')
.to_dummies(columns=['segments'])
.groupby('row_index')
.sum()
)
(4 seconds)
Does anyone know a better solution than the .9s pivot?
This approach ends up being slower than the pivot but it's a got a different trick so I'll include it.
df2=pl.from_pandas(df)
df2_ans=(df2.with_row_count('userId').with_column(pl.col('segments').str.split(',')).explode('segments') \
.with_columns([pl.when(pl.col('segments')==pl.lit(str(i))).then(pl.lit(1,pl.Int32)).otherwise(pl.lit(0,pl.Int32)).alias(str(i)) for i in range(1000)]) \
.groupby('userId')).agg(pl.exclude('segments').sum())
df_one_hot_encoded = df2_ans.to_pandas()
A couple of other observations. I'm not sure if you checked the output of your str.contains method but I would think that wouldn't work because, for example, 15 is contained within 154 when looking at strings.
The other thing, which I guess is just a preference, is the with_row_count syntax vs the pl.arrange. I don't think the performance of either is better (at least not significantly so) but you don't have to reference the df name to get the len of it which is nice.
I tried a couple other things that were also worse including not doing the explode and just doing is_in but that was slower. I tried using bools instead of 1s and 0s and then aggregating with any but that was slower.

Is it semantically possible to optimize LazyFrame -> Fill Null -> Cast to Categorical?

Here is a trivial benchmark based on a real-life workload.
import gc
import time
import numpy as np
import polars as pl
df = ( # I have a dataframe like this from reading a csv.
pl.Series(
name="x",
values=np.random.choice(
["ASPARAGUS", "BROCCOLI", ""], size=30_000_000
),
)
.to_frame()
.with_column(
pl.when(pl.col("x") == "").then(None).otherwise(pl.col("x"))
)
)
start = time.time()
df.lazy().with_column(
pl.col("x").cast(pl.Categorical).fill_null("MISSING")
).collect()
end = time.time()
print(f"Cast then fill_null took {end-start:.2f} seconds.")
Cast then fill_null took 0.93 seconds.
gc.collect()
start = time.time()
df.lazy().with_column(
pl.col("x").fill_null("MISSING").cast(pl.Categorical)
).collect()
end = time.time()
print(f"Fill_null then cast took {end-start:.2f} seconds.")
Fill_null then cast took 1.36 seconds.
(1) Am I correct to think that casting to categorical then filling null will always be faster?
(2) Am I correct to think that the result will always be identical regardless of the order?
(3) If the answers are "yes" and "yes", is it possible that someday polars will do this rearrangement automatically? Or is it actually impossible try all these sorts of permutations in a general query optimizer?
Thanks.
1: yes
2: somewhat. The logical categorcal representatition will always be the same. The physical changes by the order of occurrence of the string values. Doing fill_null before the cast, means "MISSING" will be found earlier. But this should be seen as an implementation detail.
3: Yes, this is something we can automatically optimize. Just today we merged something similar: https://github.com/pola-rs/polars/pull/4883

Filters inside for loop in pyspark is really slow

I am reading data from s3 bucket and run a for loop and do few filers and find max value. Then running on emr cluster But this is taking hours to run.
df has 1.5 m rows and df_new has 50000 rows. The reason why I converted to np array to see whether it improves the performance for the loop.
Since i am new to pyspark i am not sure whether its a efficient way to do this or a better way to do this.
Thanks in advance
df = spark.read.format('parquet').load(os.path.join('s3://', bucket_name, bucket_path_exec+date_val, report_name)
df_new = df.filter(f.col("a") == 1)
df_new = np.array(trade_report_broker.select("a", "b","c", "d","e").collect())
rows = len(df_new)
for i in range(0,rows):
aaa = df_newr[i][0]
eee = df_new[i][4]
time = df_new[i][2]
sub = df_new.filter(f.col("a") == aaa)
sub = sub.filter(f.col("b") < time)
max_time = sub.groupby().agg(f.max("eee").alias("MaxTime"))

PySpark approxSimilarityJoin() not returning any results

I am trying to find similar users by vectorizing user features and sorting by distance between user vectors in PySpark. I'm running this in Databricks on Runtime 5.5 LTS ML cluster (Scala 2.11, Spark 2.4.3)
Following the code in the docs, I am using approxSimilarityJoin() method from the pyspark.ml.feature.BucketedRandomProjectionLSH model.
I have found similar users successfully using approxSimilarityJoin(), but every now and then I come across a user of interest that apparently has no users similar to them.
Usually when approxSimilarityJoin() doesn't return anything, I assume it's because the threshold parameter is set to low. That fixes the issue sometimes, but now I've tried using a threshold of 100000 and still getting nothing back.
I define the model as
brp = BucketedRandomProjectionLSH(inputCol="scaledFeatures", outputCol="hashes", bucketLength=1.0)
I'm not sure if I changing bucketLength or numHashTables would help in obtaining results.
The following example shows a pair of users where approxSimilarityJoin() returned something (dataA, dataB) and a pair of users (dataC, dataD) where it didn't.
from pyspark.ml.linalg import Vectors
from pyspark.sql.functions import col
dataA = [(0, Vectors.dense([0.7016968702094931,0.2636417660310031,4.155293362824633,4.191398632883099]),)]
dataB = [(1, Vectors.dense([0.3757117100334294,0.2636417660310031,4.1539923630906745,4.190086328785612]),)]
dfA = spark.createDataFrame(dataA, ["customer_id", "scaledFeatures"])
dfB = spark.createDataFrame(dataB, ["customer_id", "scaledFeatures"])
brp = BucketedRandomProjectionLSH(inputCol="scaledFeatures", outputCol="hashes", bucketLength=2.0,
numHashTables=3)
model = brp.fit(dfA)
# returns
# theshold of 100000 is clearly overkill
# A dataframe with dfA and dfB feature vectors and a EuclideanDistance of 0.32599039770730354
model.approxSimilarityJoin(dfA, dfB, 100000, distCol="EuclideanDistance").show()
dataC = [(0, Vectors.dense([1.1600056435954367,78.27652460873155,3.5535837780801396,0.0030949620591871887]),)]
dataD = [(1, Vectors.dense([0.4660731192450482,39.85571715054726,1.0679201943112886,0.012330725745062067]),)]
dfC = spark.createDataFrame(dataC, ["customer_id", "scaledFeatures"])
dfD = spark.createDataFrame(dataD, ["customer_id", "scaledFeatures"])
brp = BucketedRandomProjectionLSH(inputCol="scaledFeatures", outputCol="hashes", bucketLength=2.0,
numHashTables=3)
model = brp.fit(dfC)
# returns empty df
model.approxSimilarityJoin(dfC, dfD, 100000, distCol="EuclideanDistance").show()
I was able to obtain results to the second half of the example above by increasing the bucketLength parameter value to 15. The threshold could have been lowered because the Euclidean Distance was ~34.
Per the PySpark docs:
bucketLength = the length of each hash bucket, a larger bucket lowers the false negative rate

Iterate through mixed-Type scala Lists

Using Spark 2.1.1., I have an N-row csv as 'fileInput'
colname datatype elems start end
colA float 10 0 1
colB int 10 0 9
I have successfully made an array of sql.rows ...
val df = spark.read.format("com.databricks.spark.csv").option("header", "true").load(fileInput)
val rowCnt:Int = df.count.toInt
val aryToUse = df.take(rowCnt)
Array[org.apache.spark.sql.Row] = Array([colA,float,10,0,1], [colB,int,10,0,9])
Against those Rows and using my random-value-generator scripts, I have successfully populated an empty ListBuffer[Any] ...
res170: scala.collection.mutable.ListBuffer[Any] = ListBuffer(List(0.24455154, 0.108798146, 0.111522496, 0.44311434, 0.13506883, 0.0655781, 0.8273762, 0.49718297, 0.5322746, 0.8416396), List(1, 9, 3, 4, 2, 3, 8, 7, 4, 6))
Now, I have a mixed-type ListBuffer[Any] with different typed lists.
.
How do iterate through and zip these? [Any] seems to defy mapping/zipping. I need to take N lists generated by the inputFile's definitions, then save them to a csv file. Final output should be:
ColA, ColB
0.24455154, 1
0.108798146, 9
0.111522496, 3
... etc
The inputFile can then be used to create any number of 'colnames's, of any 'datatype' (I have scripts for that), of each type appearing 1::n times, of any number of rows (defined as 'elems'). My random-generating scripts customize the values per 'start' & 'end', but these columns are not relevant for this question).
Given a List[List[Any]], you can "zip" all these lists together using transpose, if you don't mind the result being a list-of-lists instead of a list of Tuples:
val result: Seq[List[Any]] = list.transpose
If you then want to write this into a CSV, you can start by mapping each "row" into a comma-separated String:
val rows: Seq[String] = result.map(_.mkString(","))
(note: I'm ignoring the Apache Spark part, which seems completely irrelevant to this question... the "metadata" is loaded via Spark, but then it's collected into an Array so it becomes irrelevant)
I think the RDD.zipWithUniqueId() or RDD.zipWithIndex() methods can perform what you wanna do.
Please refer to official documentation for more information. hope this help you