Read JSON data from Blob where the files are stored inside date folders which auto increments everyday - scala

Hdfs blob stores the json data in the below format on a daily basis. I will need to read the json data using spark.read.json() on a day wise. Ex: Today i want to read day=01 day's files and tomorrow i want to read day=02 day's files. Is there a logic i can write in Scala which auto increments the date consider month and year also. Any help would me much appreciated.
/signals/year=2019/month=08/day=01
/signals/year=2019/month=08/day=01/*****.json
/signals/year=2019/month=08/day=01/*****.json
/signals/year=2019/month=08/day=02
/signals/year=2019/month=08/day=02/*****_.json
/signals/year=2019/month=08/day=02/*****_.json

Looks like data stored in partitioned format, and for read only one date such function can be used:
def readForDate(year: Int, month: Int, day: Int): DataFrame = {
spark.read.json("/signals")
.where($"year" === year && $"month" === month && $"day" === day)
}
For use this function, take current date and split on parts, with regular Scala code, not related to Spark.

If there is any relation between current date and the date you want to process the JSON file, you can get the current date (you can add/minus any number of days) using below Scala code and use it in your Spark application as #pasha701 suggested.
scala> import java.time.format.DateTimeFormatter
scala> import java.time.LocalDateTime
scala> val dtf = DateTimeFormatter.ofPattern("dd") // you can get the Year and Month like this.
scala> val now = LocalDateTime.now()
scala> println(dtf.format(now))
02
scala> println(dtf.format(now.plusDays(2))) // Added two days on the current date
04
Just a thought: If you are using Azure's Databricks then you can run shell command in notebook to get the current day (again if there is any relation on the partition's files you are trying to fetch with the current date) using "%sh" command.

Hope this may help any of you in future. Below code helps to read the data available in blob where the files are stored inside date folders which auto increments everyday. I wanted to read the data of previous day's data so adding now.minusDays(1)
val dtf = DateTimeFormatter.ofPattern("yyyy-MM-dd")
val now = LocalDateTime.now()
val date = dtf.format(now.minusDays(1))
val currentDateHold = date.split("-").toList
val year = currentDateHold(0)
val month = currentDateHold(1)
val day = currentDateHold(2)
val path = "/signals/year="+year+"/month="+month+"/day="+day
// Read JSON data from the Azure Blob`enter code here`
var initialDF = spark.read.format("json").load(path)

Related

Kotlin convert string to date and take one day before

I have a problem. I have a date in String f.eg "2021-05-06", and now i need to take one day before (2021-05-05). Here I'm making date from String but I cannot take one day before. Any tips?
val date = SimpleDateFormat("dd-MM-yyyy").parse(currentDate)
val formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd")
val date = LocalDate.parse("2021-05-06", formatter).minusDays(1)
println(date)
Output:
2021-05-05
If working with LocalDate is fine you could do
var date = LocalDate.parse("2021-05-06")
date = date.minusDays(1)
By analogy with similar questions in Java (there was addition, but we can perform subtraction), we can get the following piece of code:
val date = LocalDate.parse(currentDate)
val newDate = date.minusDays(1)
First similar question
Second similar question

How to convert Date column format in case class of Scala?

I am using Scala spark.I have two similar CSV files with 10 columns.One difference is with the Date column format.
1st file Date format yyyy-MM-dd
2nd file Date format dd-MM-yyyy
Objective is to: create seperate schema rdd for each file and finally merge both the Rdds.
For the first case class, I have used Date.valueOf [java.sql.Date] in the case class mapping.No issues here..
Am finding issue with the 2nd file Date format..
I have used the same Date.valueOf mapping..but it's throwing error in the date format...
How can I map the date format in the second file as like the 1st format yyyy-MM-dd? Please assist
Use java.util.Date:
val sDate1="31/12/1998"
val date1=new SimpleDateFormat("dd/MM/yyyy").parse(sDate1)
import java.text.SimpleDateFormat
Result:
sDate1: String = 31/12/1998
date1: java.util.Date = Thu Dec 31 00:00:00 CET 1998
to change the output format as a common string format.
val date2=new SimpleDateFormat("yyyy/MM/dd")
date2.format(date1)
Result:
res1: String = 1998/12/31

Spark scala: obtaining weekday from utcstamp (function works for specific date, not for entire column)

I have a scala / spark dataframe, with one column named "utcstamp" with values of the following format: 2018-12-12 21:15:00
I want to obtain a new column with the week day, and inspired by this question in the forum, used the following code:
import java.util.Calendar
import java.text.SimpleDateFormat
val dowText = new SimpleDateFormat("E")
df = df.withColumn("weekday" , dowText.format(df.select(col("utcstamp"))))
However, I get the following error:
<console>:58: error: type mismatch;
found : String
required: org.apache.spark.sql.Column
When I try this applied to a specific date (like in the link provided) it works, I just can't apply it to the whole column.
Can anyone help me with this? If you have an alternative way of converting an utc column into weekday that'll also do for me.
You can use dayofweek function of Spark SQL, which gives you a number from 1-7, for Sunday to Saturday:
val df2 = df.withColumn("weekday", dayofweek(col("utcstamp").cast("timestamp")))
Or if you want words (Sun-Sat) instead,
val df2 = df.withColumn("weekday", date_format(col("utcstamp").cast("timestamp"), "EEE"))
You can simply get the day of week with date format as "E" or EEEE (eg. for Sun and Sunday)
df.withColumn("weekday", date_format(to_timestamp($"utcstamp"), "E"))
If you want day of week as numeric value use dayofweek function which is availabe from spark 2.3+

Pyspark - Avg of Days by year and month

I have a CSV file stored in hdfs with the following format:
Business Line,Requisition (Job Title),Year,Month,Actual (# of Days)
Communications,1012_Com_Specialist,2017,February,150
Information Technology,5781_Programmer_Associate,2017,March,80
Information Technology,2497_Programmer_Senior,2017,March,120
Services,6871_Business_Analyst_Jr,2018,May,33
I would like to get the Average for Actual (# of Days) by Year and Month. Could someone please help me how I can do this using Pyspark and save the output in Parquet file?
you can convert csv to DF and run spark-sql as below:
csvRDD.map(rec => {
val i = rec.split(',');
(i(0).toString, i(1).toString, i(2).toString, i(3).toString, i(4).toInt)
}).toDF("businessline","jobtitle","year","month","actual").registerTempTable("input")
val resDF = sqlContext.sql("Select year, month, avg(actual) as avgactual from input group by year, month")
resDF.write.parquet("/user/path/solution1")

Always get "1970" when extracting a year from timestamp

I have a timestamp like "1461819600". The I execute this code in a distributed environment as val campaign_startdate_year: String = Utils.getYear(campaign_startdate_timestamp).toString
The problem is that I always get the same year 1970. Which might be the reason of it?
import com.github.nscala_time.time.Imports._
def getYear(timestamp: Any): Int = {
var dt = 2017
if (!timestamp.toString.isEmpty)
{
dt = new DateTime(timestamp.toString.toLong).getYear // toLong should be multiplied by 1000 to get millisecond value
}
dt
}
The same issue occurs when I want to get a day of a month. I get 17 instead of 28.
def getDay(timestamp: Any): Int = {
var dt = 1
if (!timestamp.toString.isEmpty)
{
dt = new DateTime(timestamp.toString.toLong).getDayOfYear
}
dt
}
The timestamp you have is a number of seconds since 01-01-1970, 00:00:00 UTC.
Java (and Scala) usually use timestamps that are a number of milliseconds since 01-01-1970, 00:00:00 UTC.
In other words, you need to multiply the number with 1000.
The timestamp that you have seems to be in seconds since the epoch (i.e. a Unix timestamp). Java time utilities expect the timestamp to be in milliseconds.
Just multiply that value by 1000 and you should get the expected results.
You can rely on either on spark sql function which have some date utilities (get year/month/day, add day/month) or you can use JodaTime library to have more control over Date and DateTime, like in my answer here: How to replace in values in spark dataframes after recalculations?