How to sum values of an entire column in pyspark - pyspark

I have a data frame with 900 columns I need the sum of each column in pyspark, so it will be 900 values in a list. Please let me know how to do this? Data has around 280 mil rows all binary data.

Assuming you already have the data in a Spark DataFrame, you can use the sum SQL function, together with DataFrame.agg.
For example:
sdf = spark.createDataFrame([[1, 3], [2, 4]], schema=['a','b'])
from pyspark.sql import functions as F
sdf.agg(F.sum(sdf.a), F.sum(sdf.b)).collect()
# Out: [Row(sum(a)=3, sum(b)=7)]
Since in your case you have quite a few columns, you can use a list comprehension to avoid naming columns explicitly.
sums = sdf.agg(*[F.sum(sdf[c_name]) for c_name in sdf.columns]).collect()
Notice how you need to unpack the arguments from the list using the * operator.

Related

Pyspark DataFrame - Discretize the selected numerical column and then apply groupby and crosstab function

I have dataframe which has 100+ numerical columns. I want to descretize some columns from it and then apply groupby function and crosstab function on these discretized columns.
Currently, I am using a loop to iterate over all selected numerical columns. But it is very time-consuming. is there any better and cleaner solution? My code looks like below:
from pyspark.ml.feature import QuantileDiscretizer
df_num = spark.createDataFrame(data = [],schema = StructType([]))
for name in number_columns:
steps = QuantileDiscretizer(numBuckets=10,inputCol=name,outputCol=name+'Bin')
Selected_data=steps.fit(Selected_data).transform(Selected_data)
tmp=Selected_data.groupBy(name+'Bin').agg(mean("ABC"),mean("XYZ"),count("ABC"),count("XYZ")).withColumnRenamed(name+'Bin','Category')
temp=Selected_data.crosstab(name+'Bin', 'code').withColumnRenamed(name+'Bin_code','Category')
temp=temp.join(tmp, 'Category','inner')
df_num=df_num.unionByName(temp,allowMissingColumns=True)

PySpark - iterate rows of a Data Frame

I need to iterate rows of a pyspark.sql.dataframe.DataFrame.DataFrame.
I have done it in pandas in the past with the function iterrows() but I need to find something similar for pyspark without using pandas.
If I do for row in myDF: it iterates columns.DataFrame
Thanks
You can use select method to operate on your dataframe using a user defined function something like this :
columns = header.columns
my_udf = F.udf(lambda data: "do what ever you want here " , StringType())
myDF.select(*[my_udf(col(c)) for c in columns])
then inside the select you can choose what you want to do with each column .

Appending multiple samples of a column into dataframe in spark

I have n (length) values in a spark column. I want to create a spark dataframe of k columns (where k is number of samples) and m rows (where m is sample size). I tried using withColumn, it is not working. Join by creating unique id will be very inefficient for me.
e.g. Spark column has following values :
102
320
11
101
2455
124
I want to create 2 samples of fraction 0.5 as columns in data frame.
So sampled data frame will be something like
sample1,sample2
320,101
124,2455
2455,11
Let df has a column UNIQUE_ID_D, I need k samples from this column. Here is the sample code for k = 2
var df1 = df.select("UNIQUE_ID_D").sample(false, 0.1).withColumnRenamed("UNIQUE_ID_D", "ID_1")
var df2 = df.select("UNIQUE_ID_D").sample(false, 0.1).withColumnRenamed("UNIQUE_ID_D", "ID_2")
df1.withColumn("NEW_UNIQUE_ID", df2.col("ID_2")).show
This wont work since withColumn can not access df2 column.
There is only way to join df1 and df2 by adding sequence column(join column) in both df's.
It is very inefficient for my use case since if I want to take 100 samples, I need to join 100 times in a loop for a single column. I need to perform this operation for all columns in original df.
How could I achieve this?

Iterate across columns in spark dataframe and calculate min max value

I want to iterate across the columns of dataframe in my Spark program and calculate min and max value.
I'm new to Spark and scala and not able to iterate over the columns once I fetch it in a dataframe.
I have tried running the below code but it needs column number to be passed to it, question is how do I fetch it from dataframe and pass it dynamically and store the result in a collection.
val parquetRDD = spark.read.parquet("filename.parquet")
parquetRDD.collect.foreach ({ i => parquetRDD_subset.agg(max(parquetRDD(parquetRDD.columns(2))), min(parquetRDD(parquetRDD.columns(2)))).show()})
Appreciate any help on this.
You should not be iterating on rows or records. You should be using aggregation function
import org.apache.spark.sql.functions._
val df = spark.read.parquet("filename.parquet")
val aggCol = col(df.columns(2))
df.agg(min(aggCol), max(aggCol)).show()
First when you do spark.read.parquet you are reading a dataframe.
Next we define the column we want to work on using the col function. The col function translate a column name to a column. You could instead use df("name") where name is the name of the column.
The agg function takes aggregation columns so min and max are aggregation functions which take a column and return a column with an aggregated value.
Update
According to the comments, the goal is to have min and max for all columns. You can therefore do this:
val minColumns = df.columns.map(name => min(col(name)))
val maxColumns = df.columns.map(name => max(col(name)))
val allMinMax = minColumns ++ maxColumns
df.agg(allMinMax.head, allMinMax.tail: _*).show()
You can also simply do:
df.describe().show()
which gives you statistics on all columns including min, max, avg, count and stddev

pyspark: get unique items in each column of a dataframe

I have a spark dataframe containing 1 million rows and 560 columns. I need to find the count of unique items in each column of the dataframe.
I have written the following code to achieve this but it is getting stuck and taking too much time to execute:
count_unique_items=[]
for j in range(len(cat_col)):
var=cat_col[j]
count_unique_items.append(data.select(var).distinct().rdd.map(lambda r:r[0]).count())
cat_col contains the column names of all the categorical variables
Is there any way to optimize this?
Try using approxCountDistinct or countDistinct:
from pyspark.sql.functions import approxCountDistinct, countDistinct
counts = df.agg(approxCountDistinct("col1"), approxCountDistinct("col2")).first()
but counting distinct elements is expensive.
You can do something like this, but as stated above, distinct element counting is expensive. The single * passes in each value as an argument, so the return value will be 1 row X N columns. I frequently do a .toPandas() call to make it easier to manipulate later down the road.
from pyspark.sql.functions import col, approxCountDistinct
distvals = df.agg(*(approxCountDistinct(col(c), rsd = 0.01).alias(c) for c in
df.columns))
You can use get every different element of each column with
df.stats.freqItems([list with column names], [percentage of frequency (default = 1%)])
This returns you a dataframe with the different values, but if you want a dataframe with just the count distinct of each column, use this:
from pyspark.sql.functions import countDistinct
df.select( [ countDistinct(cn).alias("c_{0}".format(cn)) for cn in df.columns ] ).show()
The part of the count, taken from here: check number of unique values in each column of a matrix in spark