read files from current date minus 90 days in spark - scala

I am reading all one by one files which is stored in a directory structure as YY=18/MM=12/DD=10 and need to read only current date minus 60 days. Files will be created created for every day and possibility is also that some day files wont create. so, for that day folder will not create.
I am reading all files which is stored in a directory structure as YY/MM/DD.
I am writing below code but its not working.
var datecalculate = {
var days = 0
do{
val start = DateTime.now
var start1 = DateTime.now.minusDays(days)
days = days + 1
var start2 = start1.toString
datecalculatenow(start2) }
while (days <= 90)
}
def datecalculatenow(start2:String):String={
var YY:String = start2.toString.substring(0,4)
var MM:String = start2.toString.substring(5,7)
var DD:String = start2.toString.substring(8,10)
var datepath = "YYYY=" + YY +"/MM=" +MM +"/DD=" +DD
var datepath1 = datepath.toString
org.apache.spark.sql.SparkSession.read.option("delimiter","|").
option("header","true").option("inferSchema","true").
csv("/Table/Files" + datepath1 )
}
I expect to read every files from current date minus 60 days, which has directory structure as YY/MM/DD

With spark sql you can use the following in a select statement to subtract 90 days;
date_sub(CAST(current_timestamp() as DATE), 90)

As it's possible to generate a dataframe from a list of path , why are you not first generating list of path. Here is the simple and concise way to read data from multiple paths:
val paths = (0 until 90).map(days => {
val tmpDate = DateTime.now.minusDays(days).toString()
val year = tmpDate.substring(0,4)
val month = tmpDate.substring(5,7)
val opdate = tmpDate.toString.substring(8,10)
(s"basepath/YY=$year/MM=$month/DD=$opdate")
}).toList
val df = spark.read.
option("delimiter", "|").
option("header", "true").
option("inferSchema","true")
.csv(paths:_*)
While generating paths, you can filter out the paths that do not exist. I've used some of your codes with some modifications. I've not tested in my local setup but the idea is same. Hopefully it'll help you.

Related

Kotlin: Getting the difference betweeen two dates (now and previous date)

Sorry if similar questions have been asked too many times, but it seems that there's one or more issues with every answer I find.
I have a date in the form of a String: Ex.: "04112005"
This is a date. 4th of November, 2005.
I want to get the difference, in years and days, between the current date and this date.
The code I have so far gets the year and just substracts them:
fun getAlderFraFodselsdato(bDate: String): String {
val bYr: Int = getBirthYearFromBirthDate(bDate)
var cYr: Int = Integer.parseInt(SimpleDateFormat("yyyy").format(Date()))
return (cYr-bYr).toString()
}
However, naturally, this is quite innacurate, since the month and days aren't included.
I've tried several approaches to create Date, LocalDate, SimpleDate etc. objects and using these to calcualate the difference. But for some reason I haven't gotten any of them to work.
I need to create a Date (or similar) object of the current year, month and day. Then I need to create the same object from a string containing say, month and year (""04112005""). Then I need to get the difference between these, in years, months and days.
All hints are appreciated.
I would use java.time.LocalDate for parsing and today along with a java.time.Period that calculates the period between two LocalDates for you.
See this example:
fun main(args: Array<String>) {
// parse the date with a suitable formatter
val from = LocalDate.parse("04112005", DateTimeFormatter.ofPattern("ddMMyyyy"))
// get today's date
val today = LocalDate.now()
// calculate the period between those two
var period = Period.between(from, today)
// and print it in a human-readable way
println("The difference between " + from.format(DateTimeFormatter.ISO_LOCAL_DATE)
+ " and " + today.format(DateTimeFormatter.ISO_LOCAL_DATE) + " is "
+ period.getYears() + " years, " + period.getMonths() + " months and "
+ period.getDays() + " days")
}
The output for a today of 2020-02-21 is
The difference between 2005-11-04 and 2020-02-21 is 14 years, 3 months and 17 days
It Works Below 26 API level
There are too many formates of dates you just enter the format of date and required start date and end date. It will show you result. You just see different date formate hare and here if you need.
tvDifferenceDateResult.text = getDateDifference(
"12 November, 2008",
"31 August, 2021",
"dd MMMM, yyyy")
General method to calculate date difference
fun getDateDifference(fromDate: String, toDate: String, formater: String):String{
val fmt: DateTimeFormatter = DateTimeFormat.forPattern(formater)
val mDate1: DateTime = fmt.parseDateTime(fromDate)
val mDate2: DateTime = fmt.parseDateTime(toDate)
val period = Period(mDate1, mDate2)
// period give us Year, Month, Week and Days
// days are between 0 to 6
// if you want to calculate days not weeks
//you just add 1 and multiply weeks by 7
val mDays:Int = period.days + (period.weeks*7) + 1
return "Year: ${period.years}\nMonth: ${period.months}\nDay: $mDays"
}
For legacy Date functions below api 26 without running desugaring with Gradle plugin 4.0, java.time.* use:
fun getLegacyDateDifference(fromDate: String, toDate: String, formatter: String= "yyyy-MM-dd HH:mm:ss" , locale: Locale = Locale.getDefault()): Map<String, Long> {
val fmt = SimpleDateFormat(formatter, locale)
val bgn = fmt.parse(fromDate)
val end = fmt.parse(toDate)
val milliseconds = end.time - bgn.time
val days = milliseconds / 1000 / 3600 / 24
val hours = milliseconds / 1000 / 3600
val minutes = milliseconds / 1000 / 3600
val seconds = milliseconds / 1000
val weeks = days.div(7)
return mapOf("days" to days, "hours" to hours, "minutes" to minutes, "seconds" to seconds, "weeks" to weeks)
}
The above answers using java.time.* api is much cleaner and accurate though.

Moment - Difference between two EPOCH dates

I have two dates in EPOCH value.
Open : 1579269496000
Close : 1579270005225
I want to get display different between two dates.
So difference = Close - Open = <ddd> Days <hh> Hours <mm> Mins <ss> Sec.
I'm using Moment.js to convert the date but I don't see to substract EPOCH date using that.
var c = new Date(close);
var o = new Date(open);
var seconds =enter code here (c.getTime() - o.getTime()) / 1000;
var ms = moment(close,"DD/MM/YYYY HH:mm:ss").diff(moment(open,"DD/MM/YYYY HH:mm:ss"));
Hope this solves your concern.
First convert epoch value to the value supported by momentJS. Then you could get the difference in years and add that to the initial date; then get the difference in weeks and add that to the initial date again.
var moment = require("moment");
var a = "1582113418";
var b = "1582113444";
var aa = moment.unix(a); // converted value
var bb = moment.unix(b); // converted value
var days = bb.diff(aa, "days");
aa.add(days, "years");
var hours = bb.diff(aa, "hour");
aa.add(hours, "hours");
var seconds = bb.diff(aa, "seconds");
console.log(days + " days " + hours + " hours " + seconds + " seconds");
Working sandbox: https://codesandbox.io/s/adoring-shamir-0yuxq

boto 3 - loosing date format

I'm trying to read a parquet file using boto3. The original file has dates with the following format:
2016-12-07 23:00:00.000
And they are stored as timestamps.
My code in Sage Maker is:
boto_s3 = boto3.client('s3')
r = boto_s3.select_object_content(
Bucket='bucket_name',
Key='path/file.gz.parquet',
ExpressionType='SQL',
Expression=f"select fecha_instalacion,pais from s3object s ",
InputSerialization = {'Parquet': {}},
OutputSerialization = {'CSV': {}},
)
rl0 = list(r['Payload'])[0]
from io import StringIO
string_csv = rl0['Records']['Payload'].decode('ISO-8859-1')
csv = StringIO(string_csv)
pd.read_csv(csv, names=['fecha_instalacion', 'pais'])
But instead of the date I get:
fecha_instalacion pais
45352962065516692798029824 ESPAƃA
I loooked for dates with only one day in between and the nyuumber of digits that are the same are the first 6. As an example:
45337153205849123712294912--> 2016-12-09 23:00:00.000
45337116312360976293191680--> 2016-12-07 23:00:00.000
I would need to get the correct formated date, and avoid the especial characters.
Thanks.
The problem is the format. That Parquet file is using Int96 numbers to represent timestamp.
Here is a function to convert the int96Timestamp to python DateTime
import datetime
def dateFromInt96Timestamp(int96Timestamp):
julianCalendarDays = int96Timestamp >> 8*8
time = int((int96Timestamp & 0xFFFFFFFFFFFFFFFF) / 1_000)
linuxEpoch = 2_440_588
return datetime.datetime(1970, 1, 1) + datetime.timedelta(days=julianCalendarDays - linuxEpoch, microseconds=time)

Optimizing Spark/Scala speed

I have a Spark script that establishes a connection to Hive and read Data from different databases and then writes the union into a CSV file. I tested it with two databases and it took 20 minutes. Now I am trying it with 11 databases and it has been running since yesterday evening (18 hours!). The script is supposed to get between 400000 and 800000 row per database.
My question is: is 18 hours normal for such jobs? If not, how can I optimize it? This is what my main does:
// This is a list of the ten first databases used:
var use_database_sigma = List( Parametre_vigiliste.sourceDbSigmaGca, Parametre_vigiliste.sourceDbSigmaGcm
,Parametre_vigiliste.sourceDbSigmaGge, Parametre_vigiliste.sourceDbSigmaGne
,Parametre_vigiliste.sourceDbSigmaGoc, Parametre_vigiliste.sourceDbSigmaGoi
,Parametre_vigiliste.sourceDbSigmaGra, Parametre_vigiliste.sourceDbSigmaGsu
,Parametre_vigiliste.sourceDbSigmaPvl, Parametre_vigiliste.sourceDbSigmaLbr)
val grc = Tables.getGRC(spark) // This creates the first dataframe
var sigma = Tables.getSIGMA(spark, use_database_sigma(0)) // This creates other dataframe which is the union of ten dataframes (one database each)
for(i <- 1 until use_database_sigma.length)
{
if (use_database_sigma(i) != "")
{
sigma = sigma.union(Tables.getSIGMA(spark, use_database_sigma(i)))
}
}
// writing into csv file
val grc_sigma=sigma.union(grc) // union of the 2 dataframes
grc_sigma.cache
LogDev.ecrireligne("total : " + grc_sigma.count())
grc_sigma.repartition(1).write.mode(SaveMode.Overwrite).format("csv").option("header", true).option("delimiter", "|").save(Parametre_vigiliste.cible)
val conf = new Configuration()
val fs = FileSystem.get(conf)
val file = fs.globStatus(new Path(Parametre_vigiliste.cible + "/part*"))(0).getPath().getName();
fs.rename(new Path(Parametre_vigiliste.cible + "/" + file), new Path(Parametre_vigiliste.cible + "/" + "FIC_PER_DATALAKE_.csv"));
grc_sigma.unpersist()
Not written in an IDE so it might be off somewhere, but you get the general idea.
val frames = Seq("table1", "table2).map{ table =>
spark.read.table(table).cache()
}
frames
.reduce(_.union(_)) //or unionByName() if the columns aren't in the same order
.repartition(1)
.write
.mode(SaveMode.Overwrite)
.format("csv")
.options(Map("header" -> "true", "delimiter" -> "|"))
.save("filePathName")

Working with Dates in Google Apps Script

What I am trying to do here is this - I want to give index to only the workdays in each week.
So, if in a week, Monday and Wednesday are holidays, then Tuesday should get 1, Thursday should get 2, Friday should get the index 3. Otherwise, in a normal week without any holidays, Monday should get 1, Tuesday 2, Wednesday 3, and so on ...
Here is the code I have written (I haven't coded in years now, so please pardon the crude approach)
Sheet 'Holidays' contains a list of holidays in the column B starting from row 2
Variable date is the date for which I want to find out the index for
Variable dayOfTheWeek is the number of day of 'date' counted from last Sunday, so if date is a Monday, dayOfTheWeek is 1; if date is Tuesday, dayOfTheWeek is 2, and so on ...
function indexOfWorkdayOfTheWeek (date, dayOfTheWeek, lastSundayDate)
{
var activeSheet = SpreadsheetApp.getActiveSpreadsheet();
var activeCell = activeSheet.getActiveRange();
var activeRow = activeCell.getRowIndex();
var activeColumn = activeCell.getColumn();
var count = 1;
for (var j = 1; j < dayOfTheWeek; j++)
{
var date2 = lastSundayDate.valueOf() + j*86400;
Logger.log('Date ' + j + ' is:' + date2);
Logger.log('Last Sunday is:' + lastSundayDate);
if (holidayOrNot(date2) == true)
{
}
else
{
count = count + 1;
}
}
return count;
}
function holidayOrNot(date2)
{
var holidaysSheet = SpreadsheetApp.getActiveSpreadsheet().getSheetByName('Holidays');
var listOfHolidays = holidaysSheet.getSheetValues(2, 2, 95, 1);
var isDateMatch = false;
for (var k = 0; k < 90; k++)
{
if (date2 == listOfHolidays[k].valueOf())
{
isDateMatch = true;
break;
}
else
{
continue;
}
}
return isDateMatch;
}
I think the problem is two-fold here:
The date2 calculation isn't working for some reason (var date2 = lastSundayDate.valueOf() + j*86400;)
The function holidayOrNot is returning false, no matter what, even if it encounters a holiday ... the condition date2 == listOfHolidays[k] isn't working for some reason...
Help would be appreciated!
maybe this method below could help you in your calculations, it returns an integer corresponding to the day of the year so if you apply this to your holidays days and compare to the days of interest it could be a good way to find matches.
here it is, just add these lines outside of any function in your script (so you can use it anywhere) then use it like this :
var d = new Date().getDOY();
Logger.log(d)
Here the method :
Date.prototype.getDOY = function() {
var onejan = new Date(this.getFullYear(),0,1);
return Math.ceil((this - onejan) / 86400000);
}
Assuming that lastSundayDate is being passed around correctly, I see a glaring problem:
lastSundayDate.valueOf().
valueOf() on Date objects returns the primitive value... it looks like you're going for adding a day to the date (86400 seconds * j)? I can't tell what the logic is supposed to be here. But the valueOf() date2 is definitely giving you an integer something like: 1384628769399 (see here).
What you really want to accomplish is something like Date.getDay(), or something similar so that you can add hours, days, etc. to the original Date. This is likely the source of all your problems.
What you can do is read the Mozilla Developer Network documentation on Date objects to see all of the functions on Dates and their uses. You can greatly simplify what you're trying to do by using these functions, instead of doing abstract operations like j * 86400.
It should also be noted that you can do simple operations such as the following, to add 4 hours to the current Date (time):
var myDate = new Date();
Logger.log(myDate); // ~ console.write
var laterDate = new Date(myDate.setHours(myDate.getHours() + 4));
Logger.log(laterDate); // ~ console.write
which gives the following:
[13-11-16 14:13:38:947 EST] Sat Nov 16 14:13:38 GMT-05:00 2013
[13-11-16 14:13:38:954 EST] Sat Nov 16 18:13:38 GMT-05:00 2013
Working with dates can be tricky - but it's always best to use the simplest methods that are available, which are built into the Date objects themselves. There are also numerous other libraries that provide extended functionality for Dates such as Date js.
If you're still running into your problem after attempting to try using methods I displayed above, please run your script and post both the Execution Transcript and the content of the Logger so that I can help you narrow down the issue :)