In my data frame in pyspark, I create a new column which casts an existing column to a Date,
df = df.filter(....)
.withColumn('Date',col('Time').cast('date'))\
My question is how can I run strfttime so that I just get the Year and Month of that date?
I tried doing
.withColumn('Month',col('Time').cast('date').strftime("%Y-%m"))\
I get error saying 'TypeError: 'Column' object is not callable'
strftime is not a pyspark function.
try the date_format function instead.
Related
I have parameter date and wanted it to create a new column to my dataframe. Is there any way I can pass this parameter to my DF for my new column?
val date = "2020/01/01"
df.withColumn("DateTemp", date)
You cannot directly assign a constant value -
val date = "2020/01/01"
df.withColumn("DateTemp", date)
You can achieve this with Lit -
df.withColumn("DateTemp", lit("2020/01/01"))
You can further cast your lit assignment as lit("2020/01/01").cast(DateType) if needed
I want to create a timestamp column to create a line chart from two columns containing month and year respectively.
The df looks like this:
I know I can create a string concat and then convert it to a datetime column:
df.select('*',
concat('01', df['month'],
df['year']).alias('date')).withColumn("date",
df['date'].cast(TimestampType()))
But I wanted a cleaner approach using an inbuilt PySpark functionality that can also help me create other date parts, like week number, quarters, etc. Any suggestions?
You will have to concatenate the string once, make the timestamp type column and then you can easily extract week, quarter etc.
You can use this function (and edit it to create whatever other columns you need as well):
def spark_date_parsing(df, date_column, date_format):
"""
Parses the date column given the date format in a spark dataframe
NOTE: This is a Pyspark implementation
Parameters
----------
:param df: Spark dataframe having a date column
:param date_column: Name of the date column
:param date_format: Simple Date Format (Java-style) of the dates in the date column
Returns
-------
:return: A spark dataframe with a parsed date column
"""
df = df.withColumn(date_column, F.to_timestamp(F.col(date_column), date_format))
# Spark returns 'null' if the parsing fails, so first check the count of null values
# If parse_fail_count = 0, return parsed column else raise error
parse_fail_count = df.select(
([F.count(F.when(F.col(date_column).isNull(), date_column))])
).collect()[0][0]
if parse_fail_count == 0:
return df
else:
raise ValueError(
f"Incorrect date format '{date_format}' for date column '{date_column}'"
)
Usage (with whatever is your resultant date format):
df = spark_date_parsing(df, "date", "dd/MM/yyyy")
When trying to add days to a date in another column the value returns as a number, not a date.
I have tried to set the column format as the date for both columns B and C. I also tried using the DATEVALUE() function, but I don't think I used it properly.
=ARRAYFORMULA(IF(ROW(B:B)=1,"Second Notification",IF(LEN(B:B), B:B+1,)))
I want the value in column C to return as a date.
use this with TEXT formula:
={"Second Notification";
ARRAYFORMULA(IF(LEN(B2:B), TEXT(B2:B+1, "MM/dd/yyyy hh:mm:ss"), ))}
Am I doing this right?
I have a time stamp column that I convert into a first of the month date.
df= df.withColumn("monthlyTransactionDate", f.trunc(df[transactionDate], 'mon').alias('month'))
I then run this code as I want to generate all possible months between the min and max dates:
import pyspark.sql.functions as f
minDate, maxDate = df.select(f.min("MonthlyTransactionDate"), f.max("MonthlyTransactionDate")).first()
df.withColumn("monthsDiff", f.months_between(maxDate, minDate))\
.withColumn("repeat", f.expr("split(repeat(',', monthsDiff), ',')"))\
.select("*", f.posexplode("repeat").alias("date", "val"))\
.withColumn("date", f.expr("add_months(minDate, date)"))\
.select('date')\
.show(n=50)
But get the error on the start of the last section:
TypeError: Invalid argument, not a string or column: 2016-12-01 of type <type 'datetime.date'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
Here,
minDate, maxDate = df.select(f.min("MonthlyTransactionDate"), f.max("MonthlyTransactionDate")).first()
returns min and max date value of date format.To use exact values for all rows,use lit() from functions
df.withColumn("monthsDiff", f.months_between(f.lit(maxDate), f.lit(minDate)))
I have a dataframe that have two columns (C, D) are defined as string column type, but the data in the columns are actually dates. for example column C has the date as "01-APR-2015" and column D as "20150401" I want to change these to date column type, but I didn't find a good way of doing that. I look at the stack overflow I need to convert the string column type to Date column type in Spark SQL's DataFrame. the date format can be "01-APR-2015" and I look at this post but it didn't have info relate to date
Spark >= 2.2
You can use to_date:
import org.apache.spark.sql.functions.{to_date, to_timestamp}
df.select(to_date($"ts", "dd-MMM-yyyy").alias("date"))
or to_timestamp:
df.select(to_date($"ts", "dd-MMM-yyyy").alias("timestamp"))
with intermediate unix_timestamp call.
Spark < 2.2
Since Spark 1.5 you can use unix_timestamp function to parse string to long, cast it to timestamp and truncate to_date:
import org.apache.spark.sql.functions.{unix_timestamp, to_date}
val df = Seq((1L, "01-APR-2015")).toDF("id", "ts")
df.select(to_date(unix_timestamp(
$"ts", "dd-MMM-yyyy"
).cast("timestamp")).alias("timestamp"))
Note:
Depending on a Spark version you this may require some adjustments due to SPARK-11724:
Casting from integer types to timestamp treats the source int as being in millis. Casting from timestamp to integer types creates the result in seconds.
If you use unpatched version unix_timestamp output requires multiplication by 1000.