Spark Scala script to read data from S3 on daily basis

Spark Scala script to read data from S3 on daily basis - scala

There is one java application which dump data(in csv file) to s3 on daily basis. This application create folder in S3 based on system date like(MM-DD-YYYY format) and then add files to the folder created.
Now i want to read those files from S3 on daily basis like
val fileFromS3= sc.textFile("s3a://digital/MM-DD-YYYY/abc.csv")
Now the script should replace 'MM-DD-YYYY' with the system date.
Please suggest possible solution or any other way to achieve this.

You can get currentTime with Calendar class.
First you need some imports:
import java.util.Calendar
import java.text.SimpleDateFormat
import java.util.Date
Then you can get the current Time:
val now = Calendar.getInstance().getTime()
And prepare the formatter the retrieve the date in the proper format you want and to ensure double digit month and days
val formatter = new SimpleDateFormat("MM-dd-yyyy")
Then you can use your formatter to get the date as String
val dateAsString = formatter.format(now)
Then you can load your resource with the dateAsString value using String interpolation:
edit to remove typo:
val fileFromS3= sc.textFile(s"s3a://digital/${dateAsString}/abc.csv")

Related

Google Sheets Script parseCsv - How to avoid converting text to numbers, data and formulas

I'm using the following code to import a csv file:
var csvData = Utilities.parseCsv(csvFile);
It is working but with some specific numbers, like "2.02" it is converting the text to date.
If I use the File > Import Menu, I'm able to avoid this automatic conversion, but I was not able to set this condition as a parameter on this function above (parseCSV).
Is it possible to use parseCSV and avoid this automatic text to date conversion?
Thank you!

Get system time and convert to string

I am trying to get the System date and time and use that in the setFile() method to prevent overwriting my output files. Any idea how I can do that? I went down the path of Calendar.YEAR, etc. but that will give me model date and time not System. Any suggestions on how to proceed.

You can easily create a String of your current system time using:
Date currentDate = new Date(System.currentTimeMillis());
String dateAsString = currentDate.toString();
Use that in your setFile() method for the filename and you will never overwrite any outputs (unless you ran them in the same exact second :-) ).

After further investigation I was also able to implement using a SimpleDateFormat as follows
Date date = Calendar.getInstance().getTime();
SimpleDateFormat dateFormat = new SimpleDateFormat("yyyy-MM-dd hh.mm.ss");
String strDate = "myFile"+ dateFormat.format(date)+".txt"; myFile.setFile(strDate, myFile.WRITE);

change the timestamp to UTC format in spark using scala

The question is kind of similar with the problem: Change the timestamp to UTC format in Pyspark
Basically, it is convert timestamp string format ISO8601 with offset to UTC timestamp string(2017-08-01T14:30:00+05:30 -> 2017-08-01T09:00:00+00:00 ) using scala.
I am kind of new to scala/java, I checked spark library which they dont have a way to convert without knowing the timezone, which I dont have a idea of timezone unless (I parse it in ugly way or using java/scala lib?) Can someone help?
UPDATE: The better way to do this: setup timezone session in spark, and use df.cast(DataTypes.TimestampType) to do the timezone shift

org.apache.spark.sql.functions.to_utc_timestamp:
def to_utc_timestamp(ts: Column, tz: String): Column
Given a timestamp like '2017-07-14 02:40:00.0', interprets it as a time in the given time zone, and renders that time as a timestamp in UTC. For example, 'GMT+1' would yield '2017-07-14 01:40:00.0'.

You can use the java.time primitives to parse and convert your timestamp.
scala> import java.time.{OffsetDateTime, ZoneOffset}
import java.time.{OffsetDateTime, ZoneOffset}
scala> val datetime = "2017-08-01T14:30:00+05:30"
datetime: String = 2017-08-01T14:30:00+05:30
scala> OffsetDateTime.parse(datetime).withOffsetSameInstant(ZoneOffset.UTC)
res44: java.time.OffsetDateTime = 2017-08-01T09:00Z

Can Pandas (or Python) recognize today's date?

I was wondering if Pandas can somehow figure out what today's date is allowing me to automate the naming of the html file I create when I use the "df.to_html" method.
Basically I"m trying to read a website using method "pd.read_html", and then save the dataframe as an html file, daily. The name of the html file will be the day's date. (So today is 9/28/2016 and tomorrow will be 10/01/16 and so on ) I'm not particular about the format of the date, so Sept or 09, whichever is okay.
I'm trying to automate this as much as possible, and so far the best I've gotten is, using ".format" which allows me some flexiblity. But I don't know how I can further automate the process.
import pandas as pd
df = pd.read_html('random site')
today_date = 'saved data/{}.html'.format('Sept 28') # I'm saving it in the folder "saved data" with the name as today's date.html.
df.to_html(today_date)
Thanks.

See the datetime module.
specifically: datetime.date.today().isoformat() gives you a string with the current date in ISO 8601 format (‘YYYY-MM-DD’)

Trying to get the current date with a specific format but Date is coming with a expected way

Here is the code and output. Please let me know what's wrong with the code.
import java.text.SimpleDateFormat
def myDate=new Date()
def sdf= new SimpleDateFormat("MM/DD/YYYY")
return sdf.format(myDate)
log.info sdf.format(myDate)
Op-: 04/94/2016
Thanks!

You need lower case dd AND yyyy
You can also call format on dates directly in groovy
date.format('MM/dd/yyyy')
Also, if this is to be read by anyone outside the US, or you want to be able to sort dates alphabetically, consider the more universal (iso8601) format of
date.format('yyyy-MM-dd')
As AR.3 said in their answer, the documentation for simpledateformat can be found here

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Spark Scala script to read data from S3 on daily basis - scala

Related

Google Sheets Script parseCsv - How to avoid converting text to numbers, data and formulas

Get system time and convert to string

change the timestamp to UTC format in spark using scala

Can Pandas (or Python) recognize today's date?

Trying to get the current date with a specific format but Date is coming with a expected way

Categories

Resources