Pyspark : How to take Minimum in the timestamp column? - pyspark

In pyspark , i tried to do this
df = df.select(F.col("id"),
F.col("mp_code"),
F.col("mp_def"),
F.col("mp_desc"),
F.col("mp_code_desc"),
F.col("zdmtrt06_zstation").alias("station"),
F.to_timestamp(F.col("date_time"), "yyyyMMddHHmmss").alias("date_time_utc"))
df = df.groupBy("id", "mp_code", "mp_def", "mp_desc", "mp_code_desc", "station").min(F.col("date_time_utc"))
But, i have an issue
raise TypeError("Column is not iterable")
TypeError: Column is not iterable

Here is an extract of the pyspark documentation
GroupedData.min(*cols)[source]
Computes the min value for each numeric column for each group.
New in version 1.3.0.
Parameters: cols : str
In other words, the min function does not support column arguments. It only works with column names (strings) like this:
df.groupBy("x").min("date_time_utc")
# you can also specify several column names
df.groupBy("x").min("y", "z")
Note that if you want to use a column object, you have to use agg:
df.groupBy("x").agg(F.min(F.col("date_time_utc")))

Related

How to append 'explode'd columns to a dataframe keeping all existing columns?

I'm trying to add exploded columns to a dataframe:
from pyspark.sql.functions import *
from pyspark.sql.types import *
# Convenience function for turning JSON strings into DataFrames.
def jsonToDataFrame(json, schema=None):
# SparkSessions are available with Spark 2.0+
reader = spark.read
if schema:
reader.schema(schema)
return reader.json(sc.parallelize([json]))
schema = StructType().add("a", MapType(StringType(), IntegerType()))
events = jsonToDataFrame("""
{
"a": {
"b": 1,
"c": 2
}
}
""", schema)
display(
events.withColumn("a", explode("a").alias("x", "y"))
)
However, I'm hitting the following error:
AnalysisException: The number of aliases supplied in the AS clause does not match the number of columns output by the UDTF expected 2 aliases but got a
Any ideas?
In the end, I used the following:
display(
events.select(explode("a").alias("x", "y"), *[c for c in events.columns])
)
This approach uses select to specify the columns to return.
The first argument explodes the data:
explode("a").alias("x", "y")
The second argument specifies all existing columns should be included in the select:\
*[c for c in events.columns]
Note that I'm prefixing the list with * - this sends each column name as a separate parameter.
Simpler Method
The API docs specify:
Parameters
colsstr, Column, or list
column names (string) or expressions (Column). If one of the column names is ‘*’, that column is expanded to include all columns in the current DataFrame.
We can simplify the first approach by passing in "*" to select all the columns:
display(
events.select("*", explode("a").alias("x", "y"))
)

Pyspark dynamic column name

I have a dataframe which contains months and will change quite frequently. I am saving this dataframe values as list e.g. months = ['202111', '202112', '202201']. Using a for loop to to iterate through all list elements and trying to provide dynamic column values with following code:
for i in months:
df = (
adjustment_1_prepared_df.select("product", "mnth", "col1", "col2")
.groupBy("product")
.agg(
f.min(f.when(condition, f.col("col1")).otherwise(9999999)).alias(
concat("col3_"), f.lit(i.col)
)
)
)
So basically in alias I am trying to give column name as a combination of constant (minInv_) and a variable (e.g. 202111) but I am getting error. How can I give a column name as combination of fixed string and a variable.
Thanks in advance!
.alias("col3_"+str(i.col))

Adding empty columns to dataframe with empty values (by type) pyspark

I have the following list:
columns = [('url','string'),('count','bigint'),('isindex','boolean')]
I want to add this columns to my df with empty values:
for column in columns:
df = df.withColumn(column[0], f.lit(?).cast(?))
I am not sure what I need to put in the lit function and in the cast in order to have the suitable empty value for each type
Thank you!

Match dates using two date columns as range

I am trying to create a column within databricks using pyspark. I need to check if date column is found between two other date columns and if it is then 1 if it is not then 0. I am wanting to call this ground truth, since this will tell me if on date it's found in between the two date columns. This is what I have so far:
df = (df
.withColumn("Ground_truth_IE", when(col("ReadingDateTime").between(col("EventStartDateTime") & col("EventEndDateTime")), 1).otherwiste(0)
)
)
But I continue to get an error:
TypeError: between() missing 1 required positional argument: 'upperBound'
The between() operator in pyspark should be used like: between(lowerBound, upperBound)
df = df.withColumn("Ground_truth_IE", when(col("ReadingDateTime")\
.between(col("EventStartDateTime"),col("EventEndDateTime")), 1).otherwise(0))

Iterate across columns in spark dataframe and calculate min max value

I want to iterate across the columns of dataframe in my Spark program and calculate min and max value.
I'm new to Spark and scala and not able to iterate over the columns once I fetch it in a dataframe.
I have tried running the below code but it needs column number to be passed to it, question is how do I fetch it from dataframe and pass it dynamically and store the result in a collection.
val parquetRDD = spark.read.parquet("filename.parquet")
parquetRDD.collect.foreach ({ i => parquetRDD_subset.agg(max(parquetRDD(parquetRDD.columns(2))), min(parquetRDD(parquetRDD.columns(2)))).show()})
Appreciate any help on this.
You should not be iterating on rows or records. You should be using aggregation function
import org.apache.spark.sql.functions._
val df = spark.read.parquet("filename.parquet")
val aggCol = col(df.columns(2))
df.agg(min(aggCol), max(aggCol)).show()
First when you do spark.read.parquet you are reading a dataframe.
Next we define the column we want to work on using the col function. The col function translate a column name to a column. You could instead use df("name") where name is the name of the column.
The agg function takes aggregation columns so min and max are aggregation functions which take a column and return a column with an aggregated value.
Update
According to the comments, the goal is to have min and max for all columns. You can therefore do this:
val minColumns = df.columns.map(name => min(col(name)))
val maxColumns = df.columns.map(name => max(col(name)))
val allMinMax = minColumns ++ maxColumns
df.agg(allMinMax.head, allMinMax.tail: _*).show()
You can also simply do:
df.describe().show()
which gives you statistics on all columns including min, max, avg, count and stddev