Fetch data from two tables using spark sql based on run date - scala

My application(scala/spark 2.4) reads last 6 months data from a table in hdfs(Table1) . This data will be available in a different table(Table 2) from next month. Since my application uses last 6 months of data, the source has to be both table for the first couple of months and once the new table (Table 2) has 6 months worth of data, the source should be the new table.
If the new table(Table 2) has 1 month of data, then the program should take 1 month of data from Table 2 and 5 months prior to that from Table 1.
Similarly, when the new table(Table 2) has 2 months of data, then the program should take 2 months of data from Table 2 and prior 4 months data from Table 1.
Once the new table(Table 2) has 6 months worth of data, it shouldn't take any data from the old table(Table 1)
.
How can this be achieved?
I am a beginner in spark and haven't been able to come up with a solution yet.

Related

ReplaceWhere greater than year partition in delta table

This is how my data is stored in 'year' partition in a delta table
This is the query I want to write. df_data_model only has data for years 2020 and above. After executing this query I only want year greater than 2020 to be present in delta table and rest deleted. Can this be achived with a query like this ? If yes what should I write in <?> . If not what additional script do I need to write if I am to automate this.
The gist of the question is "Delete data that do not exist in DF, replace data that exists and create new folders for new data"
(df_data_model
.write
.partitionBy("Year")
.mode('overwrite')
.option("replaceWhere", "<?>")
.format('delta')
.save(path_delta)
)
The replaceWhere works and there wont be no data in the folders. But the folders remain there for 7 days and it will be automatically deleted.
If you load from the delta table and check the years it will only have 2020 and 2021. But folders will remain for 2017-2019 for 7 days.

Get the data from multiple datasources irrepective of data available in primary datasource

I have one database with 2 months of data (e.g Jan and Feb) and other database has 12 months of data (Jan to Dec).
My task is to show only the data from both databases but for only months in primary database, in the sense my report should have data only for Jan and Feb as only 2 months exists in primary database
What I have tried:
As option1 tried to blend the data but problem is for some Id's their is no data for Jan, so in this case data from secondary database also not showing since their is no data in primary database.
Second option I tried to link databases and take left join on ID, and used Fixed LoD to pick full data,now I got data but problem here is if I used the date field from secondary database then I am getting 0 (though my data is 1000) if I take datefield from primary database then I get data for full year since I used Fixed LoD
How can I get data for only those months from secondary database (irrespective of primary database has data for those months)?

How to group data by 7 days in db2 select query without using week keyword

I am select data from a table with different timestamps and I need to group the data by 7 days intervals without using the week keyword.
You can easily use the DAYS function and divide it by 7 like
GROUP BY DAYS(my_dt_column) / 7
See https://www.ibm.com/support/knowledgecenter/en/SS6NHC/com.ibm.swg.im.dashdb.sql.ref.doc/doc/r0000789.html

Managing dates in SPSS - Time difference in months

Im a novice SPSS user and are working on a data set with two columns, customer ID and order date. I want to create a third variable with a month integer of number of inactive months since the observed customer ID:s last order date. This is how the data looks like:
This will create some sample data to demonstrate on:
data list list/ID (f3) OrderDate (adate10).
begin data
1 09/18/2016
1 03/02/2017
1 05/12/2017
2 06/06/2016
2 09/09/2017
end data.
Now you can run the following syntax to create a variable that contains the number of complete months between the date in the present row and the date in the previous row:
sort cases by ID OrderDate.
if ID=lag(ID) MonthSince=DATEDIF(OrderDate, lag(OrderDate), "months").

Spotfire Text to Integer for Dates

I am attempting to load time series data from an excel spreadsheet into spotfire. In my spreadsheet there is a separate column for year (spotfire sees it as an integer) and month (spotfire sees it as text) since it is in the three letter abbreviation format ie January is JAN. I am trying to avoid changing the data in excel and would like to do all of my work in spotfire as this will be updated periodically. How do I link these columns in spotfire so that I can plot a variable over a time frame?
Click Insert > Insert Calculated Column... Make sure you have the right data table selected. In the Expression field type:
Date([year],
case when [month]="JAN" then 1
when [month]="FEB" then 2
when [month]="MAR" then 3
when [month]="APR" then 4
when [month]="MAY" then 5
when [month]="JUN" then 6
when [month]="JUL" then 7
when [month]="AUG" then 8
when [month]="SEP" then 9
when [month]="OCT" then 10
when [month]="NOV" then 11
when [month]="DEC" then 12 end,
1)
I would name it something like "monthdate". Note that each date will have the day equal to 1. If you also have the day in your data, just put that column in the formula above instead of the last 1.