How does Azure Purview perform Data Lineage in Azure Data Factory when there are multiple Copy Data Activities on the same Source? - azure-data-factory

My particular scenario is this:
Data Factory Pipeline
I have a .txt file in Azure Blob Storage.
I copy this file in Blob to Azure SQLDB
I copy the same file to archive location in the same blob container
I then delete the file after it's archived
When I triggered the Pipeline in Azure Data Factory, Purview gave me a data lineage that only showed the Archive copy activity, and never showed the BLOB to Azure SQLDB activity.
Refer to this screenshot for lineage: Purview Data Lineage
When I navigate to the Azure SQLDB destination in Purview, it says no data lineage is available for this asset.
Here is what I have done or thought could be the reason:
Maybe the copy activities need to be in different pipelines. I tested this and same result occurs
Maybe because I deleted the file it's not picking up the Blob source to Azure SQLDB copy activity. I will be testing this, but I think it's unlikely since it did pick up the Blob Source, to Blob Archive Destination copy activity
Maybe it only picks up the last copy activity for a given source, and not all of them. I tested this and it did not change the data lineage. It is possible that I need to do something in Azure Purview to "reset" the data lineage, but I think that it uses the last pipeline run for the source and I noticed it did update the data lineage when I separated the pipeline into 2 pipelines (one for loading Azure SQLDB, and the other for archiving the Blob File)
I'm fairly stuck on this one... I will completely remove the archiving and see what happens, but according to all of the Microsoft Documentation, Data Lineage for Azure Blob and Azure SQLDB is supported, so this should be working. If anyone has answers or ideas, I would love to hear them.
Update** My newest theory is that there is a time lag between when you run a pipeline and the Data Lineage is refreshed in Purview... I am going to try disconnecting the Data Factory and Reconnecting.
Update #2** Deleting the Data Factory connection and reconnecting did nothing from what I can tell. I have been playing with how to delete assets via the REST API, which is apparently the only way to delete assets/relationships in Purview... I think my next step will be to delete the Purview Account and Storage.
Update #3*** Even after spinning up a new Purview account, the lineage still fails to show the Blob to Azure SQLDB. I am wondering if it's because of the for each logic I have, but doesn't make sense because the archive copy activity was in the for each as well. I'm at a loss. I have other Copy activities from Blob to Azure SQLDB that work, but not this one.
Thanks

After a LOT of testing. I believe the problem is Purview does not know how to handle Copy Activities that include additional columns.
Does NOT work: With additional columns
Works: Without additional columns
The ONLY difference was the fact one had additional columns mapped, and the other did not. Slight design gap...
I have created this Azure Purview Feature Request! https://feedback.azure.com/forums/932437-azure-purview/suggestions/42357196-allow-data-lineage-to-be-performed-on-azure-data-f
Please vote for this so it can be implemented in a future release.

Related

How to clear the data in a Power BI Report when publishing it to multiple clients via DevOps

We are building a generic Power BI Reporting suite for a product. Every time we commit the PBIX file to Source Control our DevOps project publishes the report to multiple workspaces, clears the SQL Connection, and reattaches it to the SQL database for each particular client. We then trigger a dataset refresh so that the report loads the data from it's own SQL database.
The problem is that the PBIX file committed to GIT contains the imported data from the LAST SQL data source it was connected to (usually our Dev SQL DB). Then when the report and dataset are published to each client the client will see this imported DEV data in their report for a period of 20 minutes or so until the dataset has refreshed, which is a security risk and confusing for the user.
I found this workaround using PowerQuery as an example, however I am really hoping for a more robust solution.
https://community.powerbi.com/t5/Desktop/Another-way-to-clear-fetched-data-from-PBIX-files/m-p/686627
How can we "clear" the imported data in the PBIX file so that when we publish to each client it is completely empty and just reloads that clients data?
Did you try a Powerbi API?
https://learn.microsoft.com/en-us/rest/api/power-bi/pushdatasets/datasets_deleterows
The ideal situation would be to be able to publish pbit instead of pbix (which contains data). Unfortunately, this is not allowed. If you can, vote for this or a similar idea.
https://ideas.powerbi.com/ideas/idea/?ideaid=b8928e24-3a1a-4a21-89d0-29fab9454a3c

Delay permanent deletion from google cloud storage bucket

I want to ensure deleted files have a window of recovery. I'd like to use the primitives offered by google cloud storage such that I don't have to maintain the logic necessary to prevent files deleted in error from being irrecoverable.
I do not see a better way to achieve than doing:
create normal bucket for files that are displayed to the users
create trash bucket for files pending permanent deletion with lifecycle rule that deletes objects after N days of creation
upon file deletion request from the normal bucket, first copy files to the trash bucket , then deletion of file from normal bucket
What is the "idiomatic" way of implementing delayed permanent deletion GCP cloud storage?
NOTE: I am trying to avoid chron jobs or additional database interaction
NOTE: this is not a soft delete as the file is expected to eventually be permanently deleted without any trace/storage associated with it
You can keep all the files in the same bucket like this, assuming two things:
Each file is also referenced in a database that you're querying to build the UI.
You're able to write backend code to manage the bucket - peoeple are not dealing with files directly with a Cloud Storage SDK.
It uses Cloud Tasks to schedule the deletion:
User asks for file deletion.
File is marked as "deleted" in the database, not actually deleted from bucket.
Use Cloud Tasks to schedule the actual deletion 5 days from now.
On schedule, the task triggers a function, which deletes the file and its database record.
Your UI will have to query the database in order to differentiate between deleted and trashed files.

How do I move data from RDS of one AWS account to another account

We have our web services and database set up on AWS a while back and application is now in production. For some reason, we need to terminate the old AWS and move everything under a newly created AWS account. Application and all the infrastructure are pretty straightforward. It is trickier for data though. The current database is still receiving lots of data on daily basis. So it is best to migrate the data after we turn off the old application and switch on new platform.
Both source RDS and target RDS are Postgres. We have about 40GB data to transfer. There are three approaches I could think of and they all have drawbacks.
Take a snapshot of the first RDS and restore it in second one. Problem is I don't need to transfer all the data from source to destination. Probably just records after 10/01 is enough. Also snapshot works best to restore in an empty rds that is just created. For our case, the new RDS will start receiving data already after the cutoff. Only after that, the data will be transferred from old account to new account otherwise we will lose data.
Dump data from tables in old RDS and backup in new RDS. This will have the same problem as #1. Also, if I dump data to local machine and then back up from local, the network speed is bottleneck.
Export table data to csv files and import to new RDS. The advantage is this method allows pick and choose and some data cleaning as well. But it takes forever to export a big fact table to local csv file. Another problem is, for some of the tables, I have surrogate row IDs which are serial (auto-incremental). The row IDs of exported csv may conflicting with existing data in new RDS tables.
I wonder if there is a better way to do it. Maybe some ETL tool AWS has which does point to point direct transfer without involving using local computer as the middle point.
In 2022 the simplest way to achieve this task is using AWS Database Migration Services (AWS DMS).
You can create a migration task, and set the original database as the source endpoint, and the new database as a destination endpoint.
Next create a task with "Full load, ongoing replication" settings.
More details here: https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.PostgreSQL.html
I have recently moved the data of RDS from one account to other using Bucardo (https://bucardo.org/). Please refer the following blogs
https://www.compose.com/articles/using-bucardo-5-3-to-migrate-a-live-postgresql-database/
https://bucardo.org/pipermail/bucardo-general/2017-February/002875.html
Though this has not mentioned exactly about migration between two RDS account, this could help setting things. We still need some intermediate point such as EC2 instance where we need to configure this Bucardo and migrate the data between accounts. If you are looking for more information, I am happy to help.
In short, we need to take a manual snapshot of the source db and restore it in the another account (https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_ShareSnapshot.html) and with Bucardo set up in the EC2 instance, we can start to sync the data using triggers and that will update the data in destination db as and then the new data comes in to the source DB.

Azure Batch support for Data Lake Store Linked Service

I am using a data factory pipeline with a custom activity (configured to run on Azure Batch) that has a data lake store input dataset and output dataset. The data lake store linked service is using service to service auth (service principal) and is working fine while being used in a Copy activity through Copy Wizard. But when used with a custom activity that tries to check if a file is present in the data lake, the activity fails with an error "Authorization is required". Upon using a Azure Blob Store as the input and output datasets, the same custom activity works fine.
Seems like an issue with Azure Batch (Compute node) not able to authorize Data Lake Store. Please help if you have solved the above mentioned problem.
I had this exact same issue about 3 weeks ago. I feel your pain!
This is a Microsoft bug!
After much trial and error and redeployments I raised a support ticket with Microsoft who confirmed that service principal authentication for data lake store currently only works with copy activities. Not with custom activities.
This is the official response I got on Monday 10th April.
The issue happen because of a bug that custom activity’s connector
schema doesn’t match the latest published connector schema. Actually,
we notice the issue on custom activity and have plan to fix & deploy
to prod in next 2 weeks.
Be aware that if you change your linked service back to use a session token etc you'll also need to redeploy your pipelines that contain the custom activities. Otherwise you'll get another error something like the following...
Access is forbidden, please check credentials and try again. Code:
'AuthenticationFailed' Message: 'Server failed to authenticate the
request. Make sure the value of Authorization header is formed
correctly including the signature.
Hope this helps.

How to Sync Queryable Metadata with Cloud Blob Storage

I am trying to understand the general architecture and components needed to link metadata with blob objects stored into the Cloud such as Azure Blob Storage or AWS.
Consider an application which allows users to upload a blob files to the cloud. With each file there would be a miriade of metadata describing the file, its cloud URL and perhaps emails of users the file is shared with.
In this case, the file gets save to the cloud and the metadata into some type of database somewhere else. How would you go about doing this transactionally so that it is guaranteed both the file was saved and the metadata? If one of the two fails the application would need to notify the user so that another attempt could be made.
There's no built-in mechanism to span transactions across two disparate systems, such as Neo4j/mongodb and Azure/AWS blob storage as you mentioned. This would be up to your app to manage. And how you go about that is really a matter of opinion/discussion.