How to add a max date in Data Frame? - pyspark

I need to add a column, which is the maximum of the date in a pyspark dataframe
Getting error: "TypeError: 'Column' object is not callable"
Part3DF = Part3DF.withColumn('latest_installation_time1', to_date(unix_timestamp(col('transaction_time'),'MM-dd-yyyy').cast("timestamp")))
Part3DF = Part3DF.withColumn('latest_installation_time', (Part3DF.latest_installation_time1).max())

Try use max(Part3DF.latest_installation_time1) instead. Part3DF.latest_installation_time1 will return a column which doesn't have max function.

Related

Pyspark : How to take Minimum in the timestamp column?

In pyspark , i tried to do this
df = df.select(F.col("id"),
F.col("mp_code"),
F.col("mp_def"),
F.col("mp_desc"),
F.col("mp_code_desc"),
F.col("zdmtrt06_zstation").alias("station"),
F.to_timestamp(F.col("date_time"), "yyyyMMddHHmmss").alias("date_time_utc"))
df = df.groupBy("id", "mp_code", "mp_def", "mp_desc", "mp_code_desc", "station").min(F.col("date_time_utc"))
But, i have an issue
raise TypeError("Column is not iterable")
TypeError: Column is not iterable
Here is an extract of the pyspark documentation
GroupedData.min(*cols)[source]
Computes the min value for each numeric column for each group.
New in version 1.3.0.
Parameters: cols : str
In other words, the min function does not support column arguments. It only works with column names (strings) like this:
df.groupBy("x").min("date_time_utc")
# you can also specify several column names
df.groupBy("x").min("y", "z")
Note that if you want to use a column object, you have to use agg:
df.groupBy("x").agg(F.min(F.col("date_time_utc")))

Match dates using two date columns as range

I am trying to create a column within databricks using pyspark. I need to check if date column is found between two other date columns and if it is then 1 if it is not then 0. I am wanting to call this ground truth, since this will tell me if on date it's found in between the two date columns. This is what I have so far:
df = (df
.withColumn("Ground_truth_IE", when(col("ReadingDateTime").between(col("EventStartDateTime") & col("EventEndDateTime")), 1).otherwiste(0)
)
)
But I continue to get an error:
TypeError: between() missing 1 required positional argument: 'upperBound'
The between() operator in pyspark should be used like: between(lowerBound, upperBound)
df = df.withColumn("Ground_truth_IE", when(col("ReadingDateTime")\
.between(col("EventStartDateTime"),col("EventEndDateTime")), 1).otherwise(0))

How to apply nltk.pos_tag on pyspark dataframe

I'm trying to apply pos tagging on one of my tokenized column called "removed" in pyspark dataframe.
I'm trying with
nltk.pos_tag(df_removed.select("removed"))
But all I get is Value Error: ValueError: Cannot apply 'in' operator against a column: please use 'contains' in a string column or 'array_contains' function for an array column.
How can I make it?
It seems the answer is in the error message: the input of pos_tag should be a string and you provide a column input. You should apply pos_tag on each row of you column, using the function withColumn
For example you start by writing:
my_new_df = df_removed.withColumn("removed", nltk.pos_tag(df_removed.removed))
You can do also :
my_new_df = df_removed.select("removed").rdd.map(lambda x: nltk.pos_tag(x)).toDF()
Here you have the documentation.

Getting max value out of a dataframe column of timestamp in scala/spark

I am working with a spark dataframe where it contains the entire timestamp values from the Column 'IMG_CREATED_DT'.I have used collectAsList() and toString() method to get the values as List and converting in to String. But I am not getting how to fetch the max value out of it.Please guide me on this.
val query_new =s"""(select IMG_CREATED_DT from
${conf.get(UNCAppConstants.DB2_SCHEMA)}.$table)"""
println(query_new)
val db2_op=ConnectionUtilities_v.createDataFrame(src_props,srcConfig.url,query_new)
val t3 = db2_op.select("IMG_CREATED_DT").collectAsList().toString
How to get the max value out of t3.
You can calculate the max value form dataframe itself. Try the following sample.
val t3 = db2_op.agg(max("IMG_CREATED_DT").as("maxVal")).take(1)(0).get(0)

how to get the row corresponding to the minimum value of some column in spark scala dataframe

i have the following code. df3 is created using the following code.i want to get the minimum value of distance_n and also the entire row containing that minimum value .
//it give just the min value , but i want entire row containing that min value
for getting the entire row , i converted this df3 to table for performing spark.sql
if i do like this
spark.sql("select latitude,longitude,speed,min(distance_n) from table1").show()
//it throws error
and if
spark.sql("select latitude,longitude,speed,min(distance_nd) from table180").show()
// by replacing the distance_n with distance_nd it throw the error
how to resolve this to get the entire row corresponding to min value
Before using a custom UDF, you have to register it in spark's sql Context.
e.g:
spark.sqlContext.udf.register("strLen", (s: String) => s.length())
After the UDF is registered, you can access it in your spark sql like
spark.sql("select strLen(some_col) from some_table")
Reference: https://docs.databricks.com/spark/latest/spark-sql/udf-scala.html