The goal is to sum the value in the 'points' column if the player's name begins with 'D' and they are younger than 20.
name
age
points
Diego
31
1
Giorgio
27
4
Pat
30
7
Doug
15
7
I've tried the following (trying to get it working) but keep getting errors:
import pyspark.sql.functions as F
point_sum = df.agg(F.when(F.col('age') < 20), F.sum('points')).collect()[0][0]
You would want to filter the data you want before doing aggregation.
(df
.where(F.col('name').startswith('D') & (F.col('age') < 20))
.groupBy(F.lit(1))
.agg(F.sum('points').alias('total_points'))
.collect()[0][1]
)
# Output: 7
Good Try. Keep Trying.
One alternative is to use filter with substring.
from pyspark.sql.functions import substring
# Filter both age <20 and name starts with 'D'.
point_sum = df.filter("age < 20 and substring(name,1,1)=='D'").select('points').agg(sum('points')).collect()[0][0]
print(point_sum)
Output:
Related
Background
I use explode to transpose columns to rows.
This works very well in general with good performance.
The source dataframe (df_audit in below code) is dynamic so can contain different structure.
Problem
Recently have incoming dataframe with very large number of columns (5 thousand) - the below code runs successfully but is very slow to run the line starting 'exploded'.
Anyone faced similar problems? I could split up the dataframe to multiple dataframes (broken out by columns) or might there be better way? Or example code?
Example code
key_cols = ["cola", "colb", "colc"]
cols = [col for col in df_audit.columns if col not in key_cols]
exploded = explode(array([struct(lit(c).alias("key"), col(c).alias("val")) for c in cols])).alias("exploded")
df_audit = df_audit.select(key_cols + [exploded]).select(key_cols + ["exploded.key", "exploded.val"])
Both lit() and col() are for some reason quite slow when used in a loop. You can try instead with arrays_zip():
exploded = explode(
arrays_zip(split(lit(','.join(cols)), ',').alias('key'), array(cols).alias('val'))
).alias('exploded')
In my quick test on 5k columns, this runs for ~6s vs. original ~25s.
Sharing some timings for bzu's approach and OP's approach based on colaboratory notebook.
cols = ['i'+str(i) for i in range(5000)]
# OP's method
%timeit func.array(*[func.struct(func.lit(k).alias('k'), func.col(k).alias('v')) for k in cols])
# 34.7 s ± 2.84 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
# bzu's method
%timeit func.arrays_zip(func.split(func.lit(','.join(cols)), ',').alias('k'), func.array(cols).alias('v'))
# 10.7 s ± 1.41 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
Thank you bzu & samkart but for some reason I cannot get the new line working.
I have created a simple example that doesn't work as follows if you can see something obvious I am missing.
from pyspark.sql.functions import (
array, arrays_zip, coalesce, col, explode, lit, lower, split, struct,substring,)
from pyspark.sql.types import StringType
def process_data():
try:
logger.info("\ntest 1")
df_audit = spark.createDataFrame([("1", "foo", "abc", "xyz"),("2", "bar", "def", "zab"),],["id", "label", "colx", "coly"])
logger.info("\ntest 2")
key_cols = ["id", "label"]
cols = [col for col in df_audit.columns if col not in key_cols]
logger.info("\ntest 3")
# exploded = explode(array([struct(lit(c).alias("key"), col(c).alias("val")) for c in cols])).alias("exploded")
exploded = explode(arrays_zip(split(lit(','.join(cols)), ',').alias('key'), array(cols).alias('val'))).alias('exploded')
logger.info("\ntest 4")
df_audit = df_audit.select(key_cols + [exploded]).select(key_cols + ["exploded.key", "exploded.val"])
df_audit.show()
except Exception as e:
logger.error("Error in process_audit_data: {}".format(e))
return False
return True
When I call process_data function I get following logged:
test 1
test 2
test 3
test 4
Error in process_audit_data: No such struct field key in 0, 1.
Note: it does work successfully with the commented exploded line
Many thanks
I am learning pyspark recently and wanted to apply in one of the problems. Basically i want to perform random trials on each record in a dataframe.My dataframe is structured as below.
order_id,order_date,distribution,quantity
O1,D1,3 4 4 5 6 7 8 ... ,10
O2,D2,1 6 9 10 12 16 18 ..., 20
O3,D3,7 12 15 16 18 20 ... ,50
Here distribution column is 100 percentile points where each value is space separated.
I want to loop through each of these rows in the dataframe and randomly select a point in the distribution and add those many days to order_date and create a new column arrival_date.
At the end i want to get the avg(quantity) by arrival_date. So my final dataframe should look like
arrival_date,qty
A1,5
A2,10
What i have achieved till now is below
df = spark.read.option("header",True).csv("/tmp/test.csv")
def randSample(row):
order_id = row.order_id
quantity = int(row.quantity)
data = []
for i in range(1,20):
n = random.randint(0,99)
randnum = int(float(row.edd.split(" ")[n]))
arrival_date = datetime.datetime.strptime(row.order_date.split(" ")[0], "%Y-%m-%d") + datetime.timedelta(days=randnum)
data.append((arrival_date, quantity))
return data
finalRDD = df.rdd.map(randSample)
The calculations look correct, however the finalRDD is structured as list of lists as below
[
[(),(),(),()]
,[(),(),(),()]
,[(),(),(),()]
,[(),(),(),()]
]
Each of the list inside the main list is a single record . And each tuple inside the nested list is a trial of that record.
Basically i want the final output as flattened records, so that i can perform the average.
[
(),
(),
(),
]
I have a Pyspark dataframe with below values -
[Row(id='ABCD123', score='28.095238095238095'), Row(id='EDFG456', score='36.2962962962963'), Row(id='HIJK789', score='37.56218905472637'), Row(id='LMNO1011', score='36.82352941176471')]
I want only the values from the DF which have score between the input score value and input score value + 1, say, the input score value is 36 then I want the output DF with only two ids - EDFG456 & LMNO1011 as their score falls between 36 & 37. I achieved this by doing as follows -
input_score_value = 36
input_df = my_df.withColumn("score_num", substring(my_df.score, 1,2))
output_matched = input_df.filter(input_df.score_num == input_score_value)
print(output_matched.take(5))
The above code gives the below output, but it takes too long to process 2 mil rows. I was thinking if there is some better way to do this to reduce the response time.
[Row(id='EDFG456', score='36.2962962962963'), Row(id='LMNO1011',score='36.82352941176471')]
You can use the function floor.
from pyspark.sql.functions import floor
output_matched = input_df.filter(foor(input_df.score_num) == input_score_value)
print(output_matched.take(5))
It should be much faster compared to substring. Let me know.
I have two columns that represents 'TeamName' and 'MatchResult' for example:
ManCity L
Liverpool D
Arsenal W
I'm trying to create a third column that represents 'Points' based on the match results of different football teams. So 3 points for Win, 1 for Draw, 0 for Lose
I've tried functions .withColumn using when and if, but can't get syntax right.
Thanks a lot in advance for your time
ManCity L 0
Liverpool D 1
Arsenal W 3
You can use:
from pyspark.sql.functions import when, col
df = df.withColumn("points", when(col("MatchResult") == "W", 3).when(col("MatchResult") == "D", 1).otherwise(0))
So, presume a matrix like so:
20 2
20 2
30 2
30 1
40 1
40 1
I want to count the number of times 1 occurs for each unique value of column 1. I could do this the long way by [sum(x(1:2,2)==1)] for each value, but I think this would be the perfect use for the UNIQUE function. How could I fix it so that I could get an output like this:
20 0
30 1
40 2
Sorry if the solution seems obvious, my grasp of loops is very poor.
Indeed unique is a good option:
u=unique(x(:,1))
res=arrayfun(#(y)length(x(x(:,1)==y & x(:,2)==1)),u)
Taking apart that last line:
arrayfun(fun,array) applies fun to each element in the array, and puts it in a new array, which it returns.
This function is the function #(y)length(x(x(:,1)==y & x(:,2)==1)) which finds the length of the portion of x where the condition x(:,1)==y & x(:,2)==1) holds (called logical indexing). So for each of the unique elements, it finds the row in X where the first is the unique element, and the second is one.
Try this (as specified in this answer):
>>> [c,~,d] = unique(a(a(:,2)==1))
c =
30
40
d =
1
3
>>> counts = accumarray(d(:),1,[],#sum)
counts =
1
2
>>> res = [c,counts]
Consider you have an array of various integers in 'array'
the tabulate function will sort the unique values and count the occurances.
table = tabulate(array)
look for your unique counts in col 2 of table.