How to mark checkpoint when reading from pystream? - pyspark

I have a daily job which reads and write using pystream. It aggregates from the table up till the previous day and write it to a different table.
yesterday = date.today() - timedelta(1)
df = spark.readStream.format("delta").load(path)
df = df.withColumn('EventDay',window(col("Time").cast("timestamp"),"1 days").start)
.withWatermark("EventDay", '0 seconds')
.filter((col('EventDay') < yesterday
.agg (avg(col('result'))
df.writeStream.format("delta").trigger(once=True)
.outputMode("append")
.option("checkpointLocation", check_point_path)
.start(output_folder)
When I run this job, the table has some today's results. But I exclude it in the filter of the query by doing (col('Day') < yesterday
For example, when I run the job on Feb 17, I see the result up till Feb 16. This is expected due to the filer to the query.
The issue here is when I run this job the next day Feb 18, it still does not the result on Feb 17.
My question is how can I mark the checkpoint to last row of yesterday result (instead of the last row of what is in the table)? so that in the next run, it will pick up the beginning of the new day and do the aggregation.

Related

Pyspark: Return next weeks saturday

I'm trying to return next weeks Saturday date from datatype column rel_d.
Normally, in python, I'd subtract number of days till next Saturday and add it to the rel_d
def next_saturday(dt):
next_sat_dt = dt + relativedelta(days=(12-dt.weekday())) # 12 as indexing starts from 0 in python
return next_sat_dt
creating a UDF in pyspark for the same seems like a bulky operation. Is there some spark operation which could do it faster?
You could use 2 next_day in pyspark to reach to next week's Saturday
Note that in pyspark day starts from Sunday (0) and ends on Saturday (7).
So, if you jump to next Sunday and then jump to next Saturday, it will be equal to your requirement.
Subsequently, You can also add multiples of 7 using F.day_add to reach nth week of your choice
df = df.withColumn('next_saturday_date',F.next_day(F.next_day(F.col('rel_d'), 'Sun'), 'Sat'))

Tableau not displaying all dates till current date

I have a tableau worksheet that shows a count of records by date. Current system date is Mar 14 but my dataset has data only till Mar 9. Is it possible to show dates from Mar 10 - Mar 14 even though there is no data for this time frame.
Given is a snapshot of my worksheet, kindly let me know how could I include all dates in the row even though there is no data for this time period.
Select discrete row it will show the date on a sheet

Parsing RRule results wrong occurrences for this specific case

Last instance of daily & monthly recurring not fetching properly, while same case works fine for weekly recurring.
I have saved the recurring pattern of events in DB, but when I fetch from DB and set events properties and called event.GetOccurrences(). I have events for 1,2,3,4 Aug but when rule is parsed(i.e after calling event.GetOccurrences()) it gives 4 instances with 1st aug repeated twice so we have occurrence for 1,1,2,3 Aug with 4 Aug got missed.
For Daily recurring meeting
FREQ=DAILY;COUNT=4;BYHOUR=13;BYMINUTE=30;BYSECOND=0
The above pattern gives instances as 1 Aug,1 Aug, 2 Aug, 3 Aug, gives 4 instances but 4 Aug missed and 1 Aug repeated twice, after this I set time in these instances.
For Weekly recurring meeting
FREQ=WEEKLY;COUNT=4
When above pattern is parsed by same function then it gives 4 instances of correct date, after that I set time for these occurrences from DB.
I have done a workaround. I have reset BY Hour, Minute and Second.
RecurrencePattern recurringPattern = new RecurrencePattern(recurringRule);
if (recurringPattern.Count > 0)
{
recurringPattern.ByHour = new List<int>();
recurringPattern.ByMinute = new List<int>();
recurringPattern.BySecond = new List<int>();
}
calendarEvent.RecurrenceRules.Add(recurringPattern);

Starting Where Query Left Off in PostgreSQL

I have this code in groovy, and I am using PostgreSQL to access the database server.
If I execute this query, it will get logs from the starting point until the end. However, logs are continuously being added every week, is there a possible way to add into the code where I can start where I left off when the code runs again?
def logs = new Logs()
String query = "SELECT * from auditLog where created < (CURRENT_DATE)::date order by id ASC"
PreparedStatement statement = conn.prepareStatement(query)
ResultSet result = statement.executeQuery()
For example:
1st run: Logs from December 1 - December 20(not fixed)
2nd run: Logs from December 21 - December 31(not fixed)
3rd run: Logs from January 1 - January 15(not fixed)
and so on....
Important thing is to be able to start where I left off, which in this case is December 21
It should by Dynamic
select *
from auditLog
where '2015-12-21' < created < current_date
Since current_date is already a date, you don't have to cast it. Note that created < current_date will get you rows until the start of today.
Simply mark the log entries as read (add some status flag) and update each time you read
UPDATE auditLog set read = true where id in (<selected ids>)
NOTICE: when updating use ids, not "UPDATE where date > ..." to avoid a case where entries have been added while you're reading..

Parse variable in the Query SSIS

In SQL Task Editor I have the following Query
DELETE FROM
[TICKETS_DATA]
where BILLING_TICKETS_DATA_Time_ID <
(SELECT TIME_ID
FROM [TIME]
WHERE (TIME_Year = ?) AND (TIME_Month = ?)) - 1
I have TIME_ID with relevant Month and Year present in the row.
I have 2 Variables present as Time_Month (int32) and Time_Year (int32) for eg 08 and 2012 respectively.
I want to pick up the Current Time_ID and pass the above query in SQL Task Editor.
Currently in the Table I was storing 1 month of data and now want to store 3 months data.
Kindly, assist me in Parameter mapping and how to parse the variable in the SQL Command query.
As long as the Time_id in the table is a numeric value that is incremented by one for each record, and there is as stated one record per year/month combo and the numbers increase sequentially, by one each time, in date order (i.e. 2000 01 has time_id 1 and 2000 02 has time_id 2 and 2001 1 has time_id 13), then you can just change the -1 to -3 to delete records from your table that are older than three months. Bear in mind that since this was probably run last month, you will have two months in the table on the first run after this change and it will delete 0 records in this first run. Next run, you will have 3 months and it will delete 0 records again. On the third run (assuming it is only run once a month) you will have three months of data and it will delete records from 4 months prior to that date.