Output Conditional Sum Of Spark DF Column To Variable

Output Conditional Sum Of Spark DF Column To Variable - pyspark

The goal is to sum the value in the 'points' column if the player's name begins with 'D' and they are younger than 20.
name
age
points
Diego
31
1
Giorgio
27
4
Pat
30
7
Doug
15
7
I've tried the following (trying to get it working) but keep getting errors:
import pyspark.sql.functions as F
point_sum = df.agg(F.when(F.col('age') < 20), F.sum('points')).collect()[0][0]

You would want to filter the data you want before doing aggregation.
(df
.where(F.col('name').startswith('D') & (F.col('age') < 20))
.groupBy(F.lit(1))
.agg(F.sum('points').alias('total_points'))
.collect()[0][1]
)
# Output: 7

Good Try. Keep Trying.
One alternative is to use filter with substring.
from pyspark.sql.functions import substring
# Filter both age <20 and name starts with 'D'.
point_sum = df.filter("age < 20 and substring(name,1,1)=='D'").select('points').agg(sum('points')).collect()[0][0]
print(point_sum)
Output:

Related

pyspark explode performance

Background
I use explode to transpose columns to rows.
This works very well in general with good performance.
The source dataframe (df_audit in below code) is dynamic so can contain different structure.
Problem
Recently have incoming dataframe with very large number of columns (5 thousand) - the below code runs successfully but is very slow to run the line starting 'exploded'.
Anyone faced similar problems? I could split up the dataframe to multiple dataframes (broken out by columns) or might there be better way? Or example code?
Example code
key_cols = ["cola", "colb", "colc"]
cols = [col for col in df_audit.columns if col not in key_cols]
exploded = explode(array([struct(lit(c).alias("key"), col(c).alias("val")) for c in cols])).alias("exploded")
df_audit = df_audit.select(key_cols + [exploded]).select(key_cols + ["exploded.key", "exploded.val"])

Both lit() and col() are for some reason quite slow when used in a loop. You can try instead with arrays_zip():
exploded = explode(
arrays_zip(split(lit(','.join(cols)), ',').alias('key'), array(cols).alias('val'))
).alias('exploded')
In my quick test on 5k columns, this runs for ~6s vs. original ~25s.

Sharing some timings for bzu's approach and OP's approach based on colaboratory notebook.
cols = ['i'+str(i) for i in range(5000)]
# OP's method
%timeit func.array(*[func.struct(func.lit(k).alias('k'), func.col(k).alias('v')) for k in cols])
# 34.7 s ± 2.84 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
# bzu's method
%timeit func.arrays_zip(func.split(func.lit(','.join(cols)), ',').alias('k'), func.array(cols).alias('v'))
# 10.7 s ± 1.41 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Thank you bzu & samkart but for some reason I cannot get the new line working.
I have created a simple example that doesn't work as follows if you can see something obvious I am missing.
from pyspark.sql.functions import (
array, arrays_zip, coalesce, col, explode, lit, lower, split, struct,substring,)
from pyspark.sql.types import StringType
def process_data():
try:
logger.info("\ntest 1")
df_audit = spark.createDataFrame([("1", "foo", "abc", "xyz"),("2", "bar", "def", "zab"),],["id", "label", "colx", "coly"])
logger.info("\ntest 2")
key_cols = ["id", "label"]
cols = [col for col in df_audit.columns if col not in key_cols]
logger.info("\ntest 3")
# exploded = explode(array([struct(lit(c).alias("key"), col(c).alias("val")) for c in cols])).alias("exploded")
exploded = explode(arrays_zip(split(lit(','.join(cols)), ',').alias('key'), array(cols).alias('val'))).alias('exploded')
logger.info("\ntest 4")
df_audit = df_audit.select(key_cols + [exploded]).select(key_cols + ["exploded.key", "exploded.val"])
df_audit.show()
except Exception as e:
logger.error("Error in process_audit_data: {}".format(e))
return False
return True
When I call process_data function I get following logged:
test 1
test 2
test 3
test 4
Error in process_audit_data: No such struct field key in 0, 1.
Note: it does work successfully with the commented exploded line
Many thanks

Performing random trials in pyspark

I am learning pyspark recently and wanted to apply in one of the problems. Basically i want to perform random trials on each record in a dataframe.My dataframe is structured as below.
order_id,order_date,distribution,quantity
O1,D1,3 4 4 5 6 7 8 ... ,10
O2,D2,1 6 9 10 12 16 18 ..., 20
O3,D3,7 12 15 16 18 20 ... ,50
Here distribution column is 100 percentile points where each value is space separated.
I want to loop through each of these rows in the dataframe and randomly select a point in the distribution and add those many days to order_date and create a new column arrival_date.
At the end i want to get the avg(quantity) by arrival_date. So my final dataframe should look like
arrival_date,qty
A1,5
A2,10
What i have achieved till now is below
df = spark.read.option("header",True).csv("/tmp/test.csv")
def randSample(row):
order_id = row.order_id
quantity = int(row.quantity)
data = []
for i in range(1,20):
n = random.randint(0,99)
randnum = int(float(row.edd.split(" ")[n]))
arrival_date = datetime.datetime.strptime(row.order_date.split(" ")[0], "%Y-%m-%d") + datetime.timedelta(days=randnum)
data.append((arrival_date, quantity))
return data
finalRDD = df.rdd.map(randSample)
The calculations look correct, however the finalRDD is structured as list of lists as below
[
[(),(),(),()]
,[(),(),(),()]
,[(),(),(),()]
,[(),(),(),()]
]
Each of the list inside the main list is a single record . And each tuple inside the nested list is a trial of that record.
Basically i want the final output as flattened records, so that i can perform the average.
[
(),
(),
(),
]

Pyspark dataframe filter based on in between values

I have a Pyspark dataframe with below values -
[Row(id='ABCD123', score='28.095238095238095'), Row(id='EDFG456', score='36.2962962962963'), Row(id='HIJK789', score='37.56218905472637'), Row(id='LMNO1011', score='36.82352941176471')]
I want only the values from the DF which have score between the input score value and input score value + 1, say, the input score value is 36 then I want the output DF with only two ids - EDFG456 & LMNO1011 as their score falls between 36 & 37. I achieved this by doing as follows -
input_score_value = 36
input_df = my_df.withColumn("score_num", substring(my_df.score, 1,2))
output_matched = input_df.filter(input_df.score_num == input_score_value)
print(output_matched.take(5))
The above code gives the below output, but it takes too long to process 2 mil rows. I was thinking if there is some better way to do this to reduce the response time.
[Row(id='EDFG456', score='36.2962962962963'), Row(id='LMNO1011',score='36.82352941176471')]

You can use the function floor.
from pyspark.sql.functions import floor
output_matched = input_df.filter(foor(input_df.score_num) == input_score_value)
print(output_matched.take(5))
It should be much faster compared to substring. Let me know.

pyspark df.withColumn with three conditions

I have two columns that represents 'TeamName' and 'MatchResult' for example:
ManCity L
Liverpool D
Arsenal W
I'm trying to create a third column that represents 'Points' based on the match results of different football teams. So 3 points for Win, 1 for Draw, 0 for Lose
I've tried functions .withColumn using when and if, but can't get syntax right.
Thanks a lot in advance for your time
ManCity L 0
Liverpool D 1
Arsenal W 3

You can use:
from pyspark.sql.functions import when, col
df = df.withColumn("points", when(col("MatchResult") == "W", 3).when(col("MatchResult") == "D", 1).otherwise(0))

Using SUM and UNIQUE to count occurrences of value within subset of a matrix

So, presume a matrix like so:
20 2
20 2
30 2
30 1
40 1
40 1
I want to count the number of times 1 occurs for each unique value of column 1. I could do this the long way by [sum(x(1:2,2)==1)] for each value, but I think this would be the perfect use for the UNIQUE function. How could I fix it so that I could get an output like this:
20 0
30 1
40 2
Sorry if the solution seems obvious, my grasp of loops is very poor.

Indeed unique is a good option:
u=unique(x(:,1))
res=arrayfun(#(y)length(x(x(:,1)==y & x(:,2)==1)),u)
Taking apart that last line:
arrayfun(fun,array) applies fun to each element in the array, and puts it in a new array, which it returns.
This function is the function #(y)length(x(x(:,1)==y & x(:,2)==1)) which finds the length of the portion of x where the condition x(:,1)==y & x(:,2)==1) holds (called logical indexing). So for each of the unique elements, it finds the row in X where the first is the unique element, and the second is one.

Try this (as specified in this answer):
>>> [c,~,d] = unique(a(a(:,2)==1))
c =
30
40
d =
1
3
>>> counts = accumarray(d(:),1,[],#sum)
counts =
1
2
>>> res = [c,counts]

Consider you have an array of various integers in 'array'
the tabulate function will sort the unique values and count the occurances.
table = tabulate(array)
look for your unique counts in col 2 of table.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Output Conditional Sum Of Spark DF Column To Variable - pyspark

You would want to filter the data you want before doing aggregation. (df .where(F.col('name').startswith('D') & (F.col('age') < 20)) .groupBy(F.lit(1)) .agg(F.sum('points').alias('total_points')) .collect()[0][1] ) # Output: 7

Related

pyspark explode performance

Performing random trials in pyspark

Pyspark dataframe filter based on in between values

pyspark df.withColumn with three conditions

Using SUM and UNIQUE to count occurrences of value within subset of a matrix

Categories

Resources