If column having dates in multiple format, Get last date of month for specific date format - scala

I have a spark data frame having two columns (SEQ - Integer, MAIN_DATE - Date) as:
Now I want to add a column based on the condition that if the format of MAIN_DATE is "MMM-YYYY" then it should be converted to Last day of the month and new data frame should look like this:
Any suggestion will be much appreciated.

You can use Spark's when/otherwise methods in order to operate differently for each different date format of the MAIN_DATE column.
More specifically, you can simply match the MMM-yyyy date format values of the column based on the field's String length (since we know that those values we always have 8 characters) as a condition in when and then:
use to_date to convert the String value to a valid date based on a format we give as an argument, and
use last_date to get the last day of the month each curry date in MAIN_DATE is referring to.
As for the "regular" rows with the dd-MMM-yyyy date format, just a to_date conversion would be sufficient within the otherwise method.
After that, all there's left to do is to convert the dates back to the desired dd-MMM-yyyy format (because to_date converts a given date to the yyyy-MM-dd format).
This is the solution in Scala (split in into two withColumns to make it more readable, instead of an one-liner):
df.withColumn("END_DATE",
when(length(col("MAIN_DATE")).equalTo(8), last_day(to_date(col("MAIN_DATE"), "MMM-yyyy")))
.otherwise(to_date(col("MAIN_DATE"), "dd-MMM-yyyy")))
.withColumn("END_DATE", date_format(col("END_DATE"), "dd-MMM-yyyy"))
This is what the resulting df DataFrame will look like:
+---+-----------+-----------+
|SEQ| MAIN_DATE| END_DATE|
+---+-----------+-----------+
| 1|16-JAN-2020|16-Jan-2020|
| 2| FEB-2017|28-Feb-2017|
+---+-----------+-----------+

Related

Converting string to date , datetime or Int in Mapping dataflow

I have a parquet file with a start_date and end_date columns
Formatted like this
01-Jan-2021
I've tried every combination conversion toDate, toString, toInterger functions but I still get Nulls returned when viewing the data (see image).
I would like to have see the result in two ways YYYYMMDD as integer column and YYYY-MM-DD as Date columns.
eg 01012021 and 01-01-2021
I'm sure the default format has caused this issue.
Thanks
First, for the Date formatter, you need to first tell ADF what each part of your string represents. Use dd-MMM-yyy for your format. Then, use a string formatter to manipulate the output as such: toString(toDate('01-Jan-2021', 'dd-MMM-yyyy'), 'yyyy-MM-dd')
For the integer representation: toInteger(toString(toDate('01-Jan-2021', 'dd-MMM-yyyy'), 'yyyyMMdd'))
Ah, you say *"I would like to have see the result in two ways YYYYMMDD as integer column and YYYY-MM-DD as Date columns. eg 01012021 and 01-01-2021"* Do you want in YYYYMMDD or dd-mm-yyy cause your example is in the later format.
Anyways, please see below expression you could use:
My source:
Use derived column:
Edit expression:
start_date_toInteger : toString(toDate(substring(start_date,1,11), 'dd-MMM-yyyy'), 'yyyymmdd')
start_date_toDate: toString(toDate(substring(start_date,1,11), 'dd-MMM-yyyy'), 'yyyy-mm-dd')
Final results:

How would I convert spark scala dataframe column to datetime?

Say I have a dataframe with two columns, both that need to be converted to datetime format. However, the current formatting of the columns varies from row to row, and when I apply to to_date method, I get all nulls returned.
Here's a screenshot of the format....
the code I tried is...
date_subset.select(col("InsertDate"),to_date(col("InsertDate")).as("to_date")).show()
which returned
Your datetime is not in the default format, so you should give the format.
to_date(col("InsertDate"), "MM/dd/yyyy HH:mm")
I don't know which one is month and date, but you can do that in this way.

compare extracted date with today() in excel

Column 1 : I have this date-time format in one column = 2018-10-08T04:30:23Z
Column 3 : I extracted date with formula = =LEFT(A11,10) and changed column format to date.
Column 32 : today(). Just to make sure both date columns match
Now when I want to compare both dates
Column 4 : =IF(C11=D11,TRUE(),FALSE())
It does not work. What did I do wrong?
One option using formulas only would be to use Excel's DATE function, which takes three parameters:
=DATE(YEAR, MONTH, DAY)
Use the following formula to extract a date from your timestamp:
=DATE(LEFT(A1,4), MID(A1,6,2), MID(A1,9,2))
This assumes that the timestamp is in cell A1, with the format in your question. Now, comparing this date value against TODAY() should work, if the original timestamp were also from today.
Should be worth trying:
=1*LEFT(A1,10)=TODAY()
May depend upon your configuration. Without format conversion (the 1*) you are trying to compare text (all string functions return Text) with a Number.

Convert date variable to numeric

Feels like an obvious question, but Stata help hasn't yielded answers. Most Stata users are interested in converting a non-date variable into a date variable, but I want the opposite.
I have a date variable date, type long, format %tdCCYYNN. I'm trying to append it to a dataset in which the same variable date is type long and format %12.0g. To accurately do this, I need to convert date in the first dataset from %tdCCYYNN to %12.0g. When I do format %12.0g date, date values change to incorrect ones.
Let's say, in the first dataset, I have date=201204. I still want it to read 201204, just as a %12.0g variable. Is there a way to do this?
I +1 all the comments above by Nick and William and suggest you read help datetime. I have been using Stata for a few years and still frequently visit this help file. Stata's date/time functionality is fantastic and you will benefit from learning it earlier rather than later.
I would convert the other data to Stata date format. Really. But if you need to convert your %td date to an "integer YYYYNN" date, then pass through a temporary file. If you write your %td date to plain text, then it will keep the displayed format and you can read it back as an integer YYYYNN date.
// data that matches your decsription
clear
set obs 1
generate date = date("20120401", "YMD")
format date %tdCCYYNN
list
// write to tempfile as plain text
tempfile plainText
outsheet using "`plainText'"
// read back with dates as integers
preserve
tempfile StataData
insheet using "`plainText'", clear
rename date dateInteger
save "`StataData'"
restore
// merge to original data
merge 1:1 _n using "`StataData'"
list
describe
This yields the following.
. list
+---------------------------------+
| date dateIn~r _merge |
|---------------------------------|
1. | 201204 201204 matched (3) |
+---------------------------------+
. describe
Contains data
obs: 1
vars: 3
size: 7
-----------------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-----------------------------------------------------------------------------------------------------
date int %tdCCYYNN
dateInteger long %12.0g
_merge byte %23.0g _merge
-----------------------------------------------------------------------------------------------------
Sorted by:
Note: Dataset has changed since last saved.
But I suggest you take advantage of Stata's date/time functionality.

Date Column Split in Talend

So I have one big file (13 million rows) and date formatted as:
2009-04-08T01:57:47Z. Now I would like to split it into 2 columns now,
one with just date as dd-MM-yyyy and other with time only hh:MM.
How do I do it?
You can simply use tMap and parseDate/formatDate to do what you want. It is neither necessary nor recommended to implement your own date parsing logic with regexes.
First of all, parse the timestamp using the format yyyy-MM-dd'T'HH:mm:ss'Z'. Then you can use the parsed Date to output the formatted date and time information you want:
dd-MM-yyyy for the date
HH:mm for the time (Note: you mixed up the case in your question, MM stands for the month)
If you put that logic into a tMap:
you will get the following:
Input:
timestamp 2009-04-08T01:57:47Z
Output:
date 08-04-2009
time 01:57
NOTE
Note that when you parse the timestamp with the mentioned format string (yyyy-MM-dd'T'HH:mm:ss'Z'), the time zone information is not parsed (having 'Z' as a literal). Since many applications do not properly set the time zone information anyway but always use 'Z' instead, so this can be safely ignored in most cases.
If you need proper time zone handling and by any chance are able to use Java 7, you may use yyyy-MM-dd'T'HH:mm:ssXXX instead to parse your timestamp.
I'm guessing Talend is falling over on the T and Z part of your date time stamp but this is easily resolved.
As your date time stamp is in a regular pattern we can easily extract the date and time from it with a tExtractRegexFields component.
You'll want to use "^([0-9]{4}-[0-9]{2}-[0-9]{2})T([0-9]{2}:[0-9]{2}):[0-9]{2}Z" as your regex which will capture the date in yyyy-MM-dd format and the time as mm:HH (you'll want to replace the date time field with a date field and a time field in the schema).
Then to format your date to your required format you'll want to use a tMap and use TalendDate.formatDate("dd-MM-yyyy",TalendDate.parseDate("yyyy-MM-dd",row7.date)) to return a string in the dd-MM-yyyy format.