Azure Data Factory: Data Lifecycle Management and cleaning up stale data - azure-data-factory

I'm working on a requirement to reduce the cost of data storage. It includes the following tasks:
Being able to remove files from File Share and blobs from Blob Storage, based on their last modified date.
Being able to change the tier of individual blobs, based on their last modified date.
Does Azure Data Factory has built-in activities to take care of these tasks? What's the best approach for automating the clean-up process?

1.Being able to remove files from File Share and blobs from Blob Storage, based on their last modified date.
This requirement could be implemented by ADF built-in method: Delete Activity.
Please create a blob storage dataset and just refer to this example and configure the range of last modify date :https://learn.microsoft.com/en-us/azure/data-factory/delete-activity#clean-up-the-expired-files-that-were-last-modified-before-201811
Please consider some back up strategy for some accidents because:
2.Being able to change the tier of individual blobs, based on their last modified date.
No built-in feature to complete this in ADF. However,while i notice that your profile shows you are .net maker, so follow this case:Azure Java SDK - set block blob to cool storage tier on upload so that you could know the Tier could be changed in sdk code. That's easy to create an Azure Function to do such simple task. Moreover,ADF supports Azure Function Activity.

Related

GCP Dataflow vs Cloud Functions to automate scrapping output and file-on-cloud merge into JSON format to insert in DB

I have two sources:
A csv that will be uploaded to a cloud storage service, probably GCP Cloud Storage.
The output of a scrapping process done with Python.
When a user updates 1) (the cloud stored file) an event should be triggered to execute 2) (the scrapping process) and then some transformation should take place in order to merge these two sources into one in a JSON format. Finally, the content of this JSON file should be stored in a DB of easy access and low cost. The files the user will update are of max 5MB and the updates will take place once weekly.
From what I've read, I can use GCP Cloud Functions to accomplish this whole process or I can use Dataflow too. I've even considered using both. I've also thought of using MongoDB to store the JSON objects of the two sources final merge.
Why should I use Cloud Functions, Dataflow or both? What are your thoughts on the DB? I'm open to different approaches. Thanks.
Regarding de use of Cloud Functions and Dataflow. In your case I will go for Cloud Functions as you don't have a big volume of data. Dataflow is more complex, more expensive and you will have to use Apache Beam. If you are confortable with python and having into consideration your scenario I will choose Cloud Functions. Easy, convenient...
To trigger a Cloud Functions when Cloud Storage object is updated you will have to configure the triggers. Pretty easy.
https://cloud.google.com/functions/docs/calling/storage
Regarding the DB. MongoDB is a good option but if you wanth something quick an inexpensive consider DataStore
As a managed service it will make your life easy with a lot of native integrations. Also it has a very interesting free tier.

How to copy Azure storage account contents (tables, queues, blobs) to other storage account

I'm using Azure Durable Functions and I want to make a copy of the "production data" storage account to another storage account, to test out some things (such as data migration strategies, performance and other issues).
For that, I was hoping to find the easiest and most out of the box way, to just copy the full contents of an Azure Storage Account to another Storage Account.
Is there something out there? Or maybe a PowerShell script that can perform this?
(replication only works for blob objects)
Best regards
Azure Storage accounts are made up of 4 different types of storage: blobs, fileshares, queues and tables. Different methods and/or strategies will be required to copy each type to a new storage account.
There are many tools you could use to automate the process including the Az PowerShell module, the az cli and the Azure REST API. Not all tools provide methods for working with all 4 types of storage.
First you will need to create a new storage account to copy your data to. Be aware that account names are globally unique not just unique within your tenant, are limited to 24 characters in length and can only be made up of lowercase letters and numbers. You could use the New-AzStorageAccount PowerShell cmdlet to create the new account.
Blobs
Blob storage is made up of containers that hold blobs in a flat file structure. You will need to recreate each container from your source account, this could be done using the PowerShell cmdlet New-AzStorageContainer. Once the containers have been created you can use Start-AzStorageBlobCopy to copy the blobs from the container in your source account to the container in your new account.
Fileshares
The storage account can contain multiple fileshares each containing a nested folder structure of varying depth. You can create a new file share in your destination account using New-AzStorageShare. Once you've recreated your fileshares you'll need to recurse through the folder structure in each share to get all the files. You can then use New-AzStorageDirectory to recreate the folder structure and Start-AzStorageFileCopy to copy the files into the new fileshare.
Queues
As queues use a publisher subscriber model and the data is transient it may be easiest to recreate the queue and use a variation on the method your current publisher uses to populate the new queue with data. You can use New-AzStorageQueue to create the new queue. Alternatively you could create the new queue and repoint the publisher to it and only repoint the subscripers once the old queue is drained. For your use case the first approach may be more suitable.
Tables
The storage account can contain multiple tables each containing multiple rows of data. You can use New-AzStorageTable to recreate the tables but this will not copy over the data they contain. Unfortunately there isn't a cmdlet in the Az module to do this but the AzTable module contains the Get-AzTableRow and Add-AzTableRow cmdlets which should allow you to copy the rows to the new table.
Summary
In practice implementing all this will require quite a lengthy script which will only grow if you need the process to handle large volumes of data and handle errors to ensure an accurate copy. I've developed a script that handles blobs and fileshares which was sufficiently robust and quick enough for the hobby project I needed it for. However, it took several hours to copy around 10 accounts the largest of which contained less than 1Gb of data so probably won't scale well to a commercial environment. The script can be found here if you wish to use it as a starting point.
Potentially you can try azcopy functionality, have a look here:
azcopy copy 'https://mysourceaccount.blob.core.windows.net/mycontainer?sv=2018-03-28&ss=bfqt&srt=sco&sp=rwdlacup&se=2019-07-04T05:30:08Z&st=2019-07-03T21:30:08Z&spr=https&sig=CAfhgnc9gdGktvB=ska7bAiqIddM845yiyFwdMH481QA8%3D' 'https://mydestinationaccount.blob.core.windows.net/mycontainer' --recursive
The example above copies the whole container from one account to another.

Do I need a storage (of some sort) when pulling data in Azure Data factory

*Data newbie here *
Currently, to run analytics report on data pulled from Dynamics 365, I use Power BI.
Issue with this is, Power BI is quite slow processing large data. I carry out a number of transform steps (e.g. Merge, Join, deleting or renaming columns, etc). So, when I try to run a query in Power BI with said steps, it takes a long time to complete.
So, as a solution, I decided to make use of Azure Data Factory(ADF). The plan is to use ADF to pull the data from CRM (i.e. Dynamics 365), perform transformations and publish the data. Then I'll use Power BI for visual analytics.
My question is:
What azure service will I need in addition to Data Factory? Will I need to store the data I pulled from CRM somewhere - like Azure Data Lake or Blob storage? Or can I do the transformation on the fly, right after the data is ingested?
Initially, I thought I could use the 'copy' activity to ingest data from CRM and start playing with the data. But using the copy activity, I needed to provide a sink (destination for the data. Which has to be a storage of some sort).
I also thought, I could make use of the 'lookup' activity. I tried to use it, but getting errors (no exception message is produced).
I have scoured the internet for a similar process (i.e. Dynamics 365 -> Data Factory -> Power BI), but I've not been able to find any.
Most of the processes I've seen however, utilises some sort of data storage right after data ingest.
All response welcome. Even if you believe I am going about this the wrong way.
Thanks.
Few things here:
The copy activity just moves data from a source, to a sink. It doesnt modify it on the fly.
The lookup activity is just to look for some atributes to use later on the same pipeline.
ADF cannot publish a dataset to power bi (although it may be able to push to a streaming dataset).
You approach is correct, but you need that last step of transforming the data. You have a lot of options here, but since you are already familiar with Power Bi you can use the Wrangling Dataflows, which allows you to take a file from the datalake, apply some power query and save a new file in the lake. You can also use Mapping Dataflows, databricks, or any other data transformation tool.
Lastly, you can pull files from a data lake with Power Bi to make your report with the data on this new file.
Of course, as always in Azure there are a lot of ways to solve problems or architect services, this is the one I consider simpler for you.
Hope this helped!

How to get max of a given column from ADF Copy Data activity

I have a copy data activity for on-premise SQL Server as source and ADLS Gen2 as sink. There is a control table to pickup tableName, watermarkDateColumn and the watermarkDatetime to pull incremental data from the source database.
After data is pulled/loaded in sink, I want to get the max of the watermarkDateColumn in my dataset. Can it be obtained from #activity('copyActivity1').output?
I'm not allowed to use one extra lookup activity to query the source table for getting the max(watermarkDateColumn) in pipeline.
Copy activity only could be used for data transmission,not for any other aggregation feature. So #activity('copyActivity1').output won't help. Since you said you can't use lookup activity, i'm afraid your requirement is not available so far.
If you prefer not using additional activities, I suggest you using Data Flow Activity instead which is more flexible.There is built-in aggregation feature in the Data Flow Activity.

Data streaming to Google Cloud ML Engine

I found that Google ml engine expects data in cloud storage, big query etc. Is there any way to stream data to ml-engine. For example, imagine that I need to use data in WordPress or Drupal site to create a tensorflow model, say a spam detector. One way is to export the whole data as CSV and upload it to cloud storage using google-cloud--php library. The problem here is that, for every minor change, we have to upload the whole data. Is there any better way?
By minor change, do you mean "when you get new data, you have to upload everything--the old and new data--again to gcs"? One idea is to export just the new data to gcs on some schedule, making many csv files over time. You can write your trainer to take a file pattern and expand it using get_matching_files/Glob or multiple file paths.
You can also modify your training code to start from an old checkpoint and train over just the new data (which is in its own file) for a few steps.