Convert characters to date format in R - date

New to R. Trying to convert a column "aza_start_year_month" with dates that are currently in character format e.g. 2010-01 into date format into a new column titled "azastartasdate".
Tried the following
bdata %>%
mutate(azastartasdate = as.Date(aza_start_year_month,format = "%Y-%m")) %>%
select(azastartasdate,aza_start_year_month)
but the "azastartasdate" column just returns with "NA" in each row.
Any help would be much appreciated!

Related

Pyspark convert string type date into dd-mm-yyyy format

Using pyspark 2.4.0
I have the date column in the dateframe as follows :
I need to convert it into DD-MM-YYYY format. I have tried a few solutions including the following code but it returns me null values,
df_students_2 = df_students.withColumn(
'new_date',
F.to_date(
F.unix_timestamp('dt', '%B %d, %Y').cast('timestamp')))
Note that different types of date format in the dt column. It would be easier if i could make the whole column in one format just for the ease of converting ,but since the dataframe is big it is not possible to go through each column and change it to one format. I have also tried the following code, just for the future readers i am including it, for the 2 types of date i tried to go through in a loop, but did not succeed.
def to_date_(col, formats=(datetime.strptime(col,"%B %d, %Y"), \
datetime.strptime(col,"%d %B %Y"), "null")):
return F.coalesce(*[F.to_date(col, f) for f in formats])
Any ideas?
Try this-
implemented in scala, but can be done pyspark with minimal change
// I've put the example formats, but just replace this list with expected formats in the dt column
val dt_formats= Seq("dd-MMM-yyyy", "MMM-dd-yyyy", "yyyy-MM-dd","MM/dd/yy","dd-MM-yy","dd-MM-yyyy","yyyy/MM/dd","dd/MM/yyyy")
val newDF = df_students.withColumn("new_date", coalesce(dt_formats.map(fmt => to_date($"dt", fmt)):_*))
Try this should work...
from pyspark.sql.functions import to_date
df = spark.createDataFrame([("Mar 25, 1991",), ("May 1, 2020",)],['date_str'])
df.select(to_date(df.date_str, 'MMM d, yyyy').alias('dt')).collect()
[Row(dt=datetime.date(1991, 3, 25)), Row(dt=datetime.date(2020, 5, 1))]
see also - Datetime Patterns for Formatting and Parsing

create a timestamp from month and year string columns in PySpark

I want to create a timestamp column to create a line chart from two columns containing month and year respectively.
The df looks like this:
I know I can create a string concat and then convert it to a datetime column:
df.select('*',
concat('01', df['month'],
df['year']).alias('date')).withColumn("date",
df['date'].cast(TimestampType()))
But I wanted a cleaner approach using an inbuilt PySpark functionality that can also help me create other date parts, like week number, quarters, etc. Any suggestions?
You will have to concatenate the string once, make the timestamp type column and then you can easily extract week, quarter etc.
You can use this function (and edit it to create whatever other columns you need as well):
def spark_date_parsing(df, date_column, date_format):
"""
Parses the date column given the date format in a spark dataframe
NOTE: This is a Pyspark implementation
Parameters
----------
:param df: Spark dataframe having a date column
:param date_column: Name of the date column
:param date_format: Simple Date Format (Java-style) of the dates in the date column
Returns
-------
:return: A spark dataframe with a parsed date column
"""
df = df.withColumn(date_column, F.to_timestamp(F.col(date_column), date_format))
# Spark returns 'null' if the parsing fails, so first check the count of null values
# If parse_fail_count = 0, return parsed column else raise error
parse_fail_count = df.select(
([F.count(F.when(F.col(date_column).isNull(), date_column))])
).collect()[0][0]
if parse_fail_count == 0:
return df
else:
raise ValueError(
f"Incorrect date format '{date_format}' for date column '{date_column}'"
)
Usage (with whatever is your resultant date format):
df = spark_date_parsing(df, "date", "dd/MM/yyyy")

Why is my formula returning a number instead of a date?

When trying to add days to a date in another column the value returns as a number, not a date.
I have tried to set the column format as the date for both columns B and C. I also tried using the DATEVALUE() function, but I don't think I used it properly.
=ARRAYFORMULA(IF(ROW(B:B)=1,"Second Notification",IF(LEN(B:B), B:B+1,)))
I want the value in column C to return as a date.
use this with TEXT formula:
={"Second Notification";
ARRAYFORMULA(IF(LEN(B2:B), TEXT(B2:B+1, "MM/dd/yyyy hh:mm:ss"), ))}

Difference between Date column and other date

I would like to find difference between Date column in spark dataset and date value which is not column.
If both were column i would do
datediff(col("dateStart"),col("dateEnd")
As i want to find difference between col("dateStart") and another date which is not column
val dsWithTimeDiff = detailedRecordsDs.withColumn(RunDate,
lit(runDate.toString))
val dsWithTimeDiff = dsWithRunDate.withColumn(DateDiff,
datediff(to_date(col(RunDate)), col(DateCol)))
Is there better way to do instead of adding one more column and then finding difference

PBI Slicer Issue

I have the following problem applying dax in PowerBI:
DTAT column = Initinal Disposition Date - Document Date
For those cells in the Initial Disposition Date column that I don't have data, the calculation will certainly return negative values since all cells in Document Date column have values.
Is it possible to create a slicer based on the DTAT column where I can omit the negative values on the left hand side (but not delete them since that will change/remove other relevant data in my report)? Like not have it shown in the slicer but it's still there to slide?
I've got something like this to calculate the DTAT in Power BI:
DTAT column = IF( Initinal Disposition Date = blank(), blank(), Initinal Disposition Date - Document Date)
For some reason when I applied the function, there're still negative numbers in the slicer range.
Before-applying the range was: -43,000 to 324
After-applying the range was: -290 to 324
Does anyone have any ideas why the negative numbers still appear? (Even though there are no cells where the Document Date > Initial Disposition Date, except for those that are initially null that will become blank when applying the function).
Thank you so much! I'm new to PBI so any suggestions or ideas are highly appreciated!
I would say that the negative values are right, because you are comparing Document Date behind and ahead the Initial Disposition Date.
If you only want the different between these 2 dates I would say to apply the following formula:
DTAT column = ABS(IF( Initinal Disposition Date = blank(), blank(), Initinal Disposition Date - Document Date))
If are not this the issue you are saying tell me please!
For to the comment for my first answer, you can add a column that return the negative values as 0:
DTAT column(WithoutNegativeValues) = IF(DTAT column<0, 0, DTAT column)
Is this the result you want to obtain?