I have used 'agg' and get average value of a column in my data frame, like this
df.groupBy('day','city')
.agg(count("*"),
avg(df.price).alias("avgPrice")
)
From here Calculate percentile on pyspark dataframe columns, it said I can use df.selectExpr('percentile(MOU_G_EDUCATION_ADULT, 0.95)') to get 95 percentile of a column.
So how can I add the that to inside the agg() function?
You can use expr function to add in agg.
(df.groupBy('city')
.agg(count("*"),
avg(df.price).alias("avgPrice"),
expr("percentile(price, 0.95)").alias("percentile"))
)
However, as the link suggested, if your dataset is large and do not mind some approximations, consider using percentile_approx.
(df.groupBy('city')
.agg(count("*"),
avg(df.price).alias("avgPrice"),
percentile_approx('price', 0.95).alias('percentile'))
)
Related
I have data like this
I want output like this
How do I achieve this?
One way of doing is: pivot, create an array and sum values within the array
from pyspark.sql.functions import *
s =df.groupby('id').pivot('year').agg(sum('amount'))#Pivot
(s.withColumn('x', array(*[x for x in s.columns if x!='id']))#create array
.withColumn('x', expr("reduce(x,cast(0 as bigint),(c,i)-> c+i)"))#sum
).show()
OR use pysparks inbuilt aggregate function
s =df.groupby('id').pivot('year').agg(sum('amount'))#Pivot
(s.withColumn('x', array(*[x for x in s.columns if x!='id']))#create array
.withColumn('x', expr("aggregate(x,cast(0 as bigint),(c,i)-> c+i)"))#sum
).show()
I have a data frame with 900 columns I need the sum of each column in pyspark, so it will be 900 values in a list. Please let me know how to do this? Data has around 280 mil rows all binary data.
Assuming you already have the data in a Spark DataFrame, you can use the sum SQL function, together with DataFrame.agg.
For example:
sdf = spark.createDataFrame([[1, 3], [2, 4]], schema=['a','b'])
from pyspark.sql import functions as F
sdf.agg(F.sum(sdf.a), F.sum(sdf.b)).collect()
# Out: [Row(sum(a)=3, sum(b)=7)]
Since in your case you have quite a few columns, you can use a list comprehension to avoid naming columns explicitly.
sums = sdf.agg(*[F.sum(sdf[c_name]) for c_name in sdf.columns]).collect()
Notice how you need to unpack the arguments from the list using the * operator.
I have n (length) values in a spark column. I want to create a spark dataframe of k columns (where k is number of samples) and m rows (where m is sample size). I tried using withColumn, it is not working. Join by creating unique id will be very inefficient for me.
e.g. Spark column has following values :
102
320
11
101
2455
124
I want to create 2 samples of fraction 0.5 as columns in data frame.
So sampled data frame will be something like
sample1,sample2
320,101
124,2455
2455,11
Let df has a column UNIQUE_ID_D, I need k samples from this column. Here is the sample code for k = 2
var df1 = df.select("UNIQUE_ID_D").sample(false, 0.1).withColumnRenamed("UNIQUE_ID_D", "ID_1")
var df2 = df.select("UNIQUE_ID_D").sample(false, 0.1).withColumnRenamed("UNIQUE_ID_D", "ID_2")
df1.withColumn("NEW_UNIQUE_ID", df2.col("ID_2")).show
This wont work since withColumn can not access df2 column.
There is only way to join df1 and df2 by adding sequence column(join column) in both df's.
It is very inefficient for my use case since if I want to take 100 samples, I need to join 100 times in a loop for a single column. I need to perform this operation for all columns in original df.
How could I achieve this?
I want to iterate across the columns of dataframe in my Spark program and calculate min and max value.
I'm new to Spark and scala and not able to iterate over the columns once I fetch it in a dataframe.
I have tried running the below code but it needs column number to be passed to it, question is how do I fetch it from dataframe and pass it dynamically and store the result in a collection.
val parquetRDD = spark.read.parquet("filename.parquet")
parquetRDD.collect.foreach ({ i => parquetRDD_subset.agg(max(parquetRDD(parquetRDD.columns(2))), min(parquetRDD(parquetRDD.columns(2)))).show()})
Appreciate any help on this.
You should not be iterating on rows or records. You should be using aggregation function
import org.apache.spark.sql.functions._
val df = spark.read.parquet("filename.parquet")
val aggCol = col(df.columns(2))
df.agg(min(aggCol), max(aggCol)).show()
First when you do spark.read.parquet you are reading a dataframe.
Next we define the column we want to work on using the col function. The col function translate a column name to a column. You could instead use df("name") where name is the name of the column.
The agg function takes aggregation columns so min and max are aggregation functions which take a column and return a column with an aggregated value.
Update
According to the comments, the goal is to have min and max for all columns. You can therefore do this:
val minColumns = df.columns.map(name => min(col(name)))
val maxColumns = df.columns.map(name => max(col(name)))
val allMinMax = minColumns ++ maxColumns
df.agg(allMinMax.head, allMinMax.tail: _*).show()
You can also simply do:
df.describe().show()
which gives you statistics on all columns including min, max, avg, count and stddev
I have a spark dataframe containing 1 million rows and 560 columns. I need to find the count of unique items in each column of the dataframe.
I have written the following code to achieve this but it is getting stuck and taking too much time to execute:
count_unique_items=[]
for j in range(len(cat_col)):
var=cat_col[j]
count_unique_items.append(data.select(var).distinct().rdd.map(lambda r:r[0]).count())
cat_col contains the column names of all the categorical variables
Is there any way to optimize this?
Try using approxCountDistinct or countDistinct:
from pyspark.sql.functions import approxCountDistinct, countDistinct
counts = df.agg(approxCountDistinct("col1"), approxCountDistinct("col2")).first()
but counting distinct elements is expensive.
You can do something like this, but as stated above, distinct element counting is expensive. The single * passes in each value as an argument, so the return value will be 1 row X N columns. I frequently do a .toPandas() call to make it easier to manipulate later down the road.
from pyspark.sql.functions import col, approxCountDistinct
distvals = df.agg(*(approxCountDistinct(col(c), rsd = 0.01).alias(c) for c in
df.columns))
You can use get every different element of each column with
df.stats.freqItems([list with column names], [percentage of frequency (default = 1%)])
This returns you a dataframe with the different values, but if you want a dataframe with just the count distinct of each column, use this:
from pyspark.sql.functions import countDistinct
df.select( [ countDistinct(cn).alias("c_{0}".format(cn)) for cn in df.columns ] ).show()
The part of the count, taken from here: check number of unique values in each column of a matrix in spark