This is how my data is stored in 'year' partition in a delta table
This is the query I want to write. df_data_model only has data for years 2020 and above. After executing this query I only want year greater than 2020 to be present in delta table and rest deleted. Can this be achived with a query like this ? If yes what should I write in <?> . If not what additional script do I need to write if I am to automate this.
The gist of the question is "Delete data that do not exist in DF, replace data that exists and create new folders for new data"
(df_data_model
.write
.partitionBy("Year")
.mode('overwrite')
.option("replaceWhere", "<?>")
.format('delta')
.save(path_delta)
)
The replaceWhere works and there wont be no data in the folders. But the folders remain there for 7 days and it will be automatically deleted.
If you load from the delta table and check the years it will only have 2020 and 2021. But folders will remain for 2017-2019 for 7 days.
Related
I am loading the delta tables into S3 delta lake. the table schema is product_code,date,quantity,crt_dt.
i am getting 6 months of Forecast data, for example if this month is May 2022, i will get May, June, July, Aug, Sept, Oct quantities data. What is the issue i am facing here is the data is getting duplicated every month. i want only a single row in the delta table based on the recent crt_dt as shown in below screenshot. Can anyone help me with the solution i should implement?
The data is partitioned by crt_dt.
Thanks!
If you want to get the recent crt_dt normally this code will do the trick
w3 = Window.partitionBy("product_cat").orderBy(col("crt_dt").desc())
df.withColumn("row",row_number().over(w3)) \
.filter(col("row") == 1).drop("row") \
.show()
for more details check this https://sparkbyexamples.com/pyspark/pyspark-select-first-row-of-each-group/
You have a dataset that you'd like to filter and then write out to a Delta table.
Another poster told you how to filter the data to meet your requirements. Here's how to filter the data and then write it out.
filtered_df = df.withColumn("row",row_number().over(w3)) \
.filter(col("row") == 1).drop("row") \
.show()
filtered_df.write.format("delta").mode("append").save("path/to/delta_lake")
You can also do this with SQL if you aren't using the Python API.
I am very new to Azure Data Factory. I have created a simple Pipeline using the same source and target table. The pipeline is supposed to take the date column from the source table, apply an expression to the column date (datatype date as shown in the schema below) in the source table, and it is supposed to either load 1 if the date is within the last 7 days or 0 otherwise in the column last_7_days (as in schema).
The schema for both source and target tables look like this:
Now, I am facing a challenge to write an expression in the component DerivedColumn. I have managed to find out the date which is 7 days ago with the expression: .
In summary, the idea is to load last_7_days column in target Table with value '1' if date >= current date - interval 7 day and date <= current date like in SQL.I would be very grateful, if anyone could help me with any tips and suggestions. If you require further information, please let me know.
Just for more information: source/target table column date is static with 10 years of date from 2020 till 2030 in yyyy-mm-dd format. ETL should run everyday and only put value 1 to the last 7 days: column last_7_days looking back from current date. Other entries must recieve value 0.
You currently use the expression bellow:
case ( date == currentDate(),1, date >= subDays(currentDate(),7),1, date <subDays(currentDate(),7,0, date > currentDate(),0)
If we were you, we will also choose case() function to build the expression.
About you question in comment, I'm afraid no, there isn't an another elegant way for. To achieve our request, Data Flow expression can be complex. It may be comprised with many functions. case() function is the best one for you.
It's very clear and easy to understand.
I have one database with 2 months of data (e.g Jan and Feb) and other database has 12 months of data (Jan to Dec).
My task is to show only the data from both databases but for only months in primary database, in the sense my report should have data only for Jan and Feb as only 2 months exists in primary database
What I have tried:
As option1 tried to blend the data but problem is for some Id's their is no data for Jan, so in this case data from secondary database also not showing since their is no data in primary database.
Second option I tried to link databases and take left join on ID, and used Fixed LoD to pick full data,now I got data but problem here is if I used the date field from secondary database then I am getting 0 (though my data is 1000) if I take datefield from primary database then I get data for full year since I used Fixed LoD
How can I get data for only those months from secondary database (irrespective of primary database has data for those months)?
We have our data in relational database in single table with columns id and date as this.
productid date value1 value2
1 2005-10-26 24 27
1 2005-10-27 22 28
2 2005-10-26 12 18
Trying to load them to s3 as parquet and create metadata in hive to query them using athena and redshift. Our most frequent queries will be filtering on product id, day, month and year. So trying to load the data partitions in a way to have better query performance.
From what i understood, I can create the partitions like this
s3://my-bucket/my-dataset/dt=2017-07-01/
...
s3://my-bucket/my-dataset/dt=2017-07-09/
s3://my-bucket/my-dataset/dt=2017-07-10/
or like this,
s3://mybucket/year=2017/month=06/day=01/
s3://mybucket/year=2017/month=06/day=02/
...
s3://mybucket/year=2017/month=08/day=31/
Which will be faster in terms of query as I have 7 years data.
Also, how can i add partitioning for product id here? So that it will be faster.
How can i create this (s3://mybucket/year=2017/month=06/day=01/) folder structures with key=value using spark scala.? Any examples?
We partitioned like this,
s3://bucket/year/month/year/day/hour/minute/product/region/availabilityzone/
s3://bucketname/2018/03/01/11/30/nest/e1/e1a
minute is rounded to 30 mins. If you traffic is high, you can go for higher resolution for minutes or you can reduce by hour or even by day.
It helped a lot based on what data we want to query (using Athena or Redshift Spectrum) and for what time duration.
Hope it helps.
In my SSDT report I have a field coming from a database called modified date. I want in my table in my report to only display records where the modified date is more than 7 days old, so in another way if a record hasn't been modified in 7 days then show the record.
I've looked at the Filter properties of the table but can't figure out the exact way of achieving what I want
Does anyone have any suggestions/examples?