Azure Batch support for Data Lake Store Linked Service - azure-data-factory

I am using a data factory pipeline with a custom activity (configured to run on Azure Batch) that has a data lake store input dataset and output dataset. The data lake store linked service is using service to service auth (service principal) and is working fine while being used in a Copy activity through Copy Wizard. But when used with a custom activity that tries to check if a file is present in the data lake, the activity fails with an error "Authorization is required". Upon using a Azure Blob Store as the input and output datasets, the same custom activity works fine.
Seems like an issue with Azure Batch (Compute node) not able to authorize Data Lake Store. Please help if you have solved the above mentioned problem.

I had this exact same issue about 3 weeks ago. I feel your pain!
This is a Microsoft bug!
After much trial and error and redeployments I raised a support ticket with Microsoft who confirmed that service principal authentication for data lake store currently only works with copy activities. Not with custom activities.
This is the official response I got on Monday 10th April.
The issue happen because of a bug that custom activity’s connector
schema doesn’t match the latest published connector schema. Actually,
we notice the issue on custom activity and have plan to fix & deploy
to prod in next 2 weeks.
Be aware that if you change your linked service back to use a session token etc you'll also need to redeploy your pipelines that contain the custom activities. Otherwise you'll get another error something like the following...
Access is forbidden, please check credentials and try again. Code:
'AuthenticationFailed' Message: 'Server failed to authenticate the
request. Make sure the value of Authorization header is formed
correctly including the signature.
Hope this helps.

Related

How to clear the data in a Power BI Report when publishing it to multiple clients via DevOps

We are building a generic Power BI Reporting suite for a product. Every time we commit the PBIX file to Source Control our DevOps project publishes the report to multiple workspaces, clears the SQL Connection, and reattaches it to the SQL database for each particular client. We then trigger a dataset refresh so that the report loads the data from it's own SQL database.
The problem is that the PBIX file committed to GIT contains the imported data from the LAST SQL data source it was connected to (usually our Dev SQL DB). Then when the report and dataset are published to each client the client will see this imported DEV data in their report for a period of 20 minutes or so until the dataset has refreshed, which is a security risk and confusing for the user.
I found this workaround using PowerQuery as an example, however I am really hoping for a more robust solution.
https://community.powerbi.com/t5/Desktop/Another-way-to-clear-fetched-data-from-PBIX-files/m-p/686627
How can we "clear" the imported data in the PBIX file so that when we publish to each client it is completely empty and just reloads that clients data?
Did you try a Powerbi API?
https://learn.microsoft.com/en-us/rest/api/power-bi/pushdatasets/datasets_deleterows
The ideal situation would be to be able to publish pbit instead of pbix (which contains data). Unfortunately, this is not allowed. If you can, vote for this or a similar idea.
https://ideas.powerbi.com/ideas/idea/?ideaid=b8928e24-3a1a-4a21-89d0-29fab9454a3c

How does Azure Purview perform Data Lineage in Azure Data Factory when there are multiple Copy Data Activities on the same Source?

My particular scenario is this:
Data Factory Pipeline
I have a .txt file in Azure Blob Storage.
I copy this file in Blob to Azure SQLDB
I copy the same file to archive location in the same blob container
I then delete the file after it's archived
When I triggered the Pipeline in Azure Data Factory, Purview gave me a data lineage that only showed the Archive copy activity, and never showed the BLOB to Azure SQLDB activity.
Refer to this screenshot for lineage: Purview Data Lineage
When I navigate to the Azure SQLDB destination in Purview, it says no data lineage is available for this asset.
Here is what I have done or thought could be the reason:
Maybe the copy activities need to be in different pipelines. I tested this and same result occurs
Maybe because I deleted the file it's not picking up the Blob source to Azure SQLDB copy activity. I will be testing this, but I think it's unlikely since it did pick up the Blob Source, to Blob Archive Destination copy activity
Maybe it only picks up the last copy activity for a given source, and not all of them. I tested this and it did not change the data lineage. It is possible that I need to do something in Azure Purview to "reset" the data lineage, but I think that it uses the last pipeline run for the source and I noticed it did update the data lineage when I separated the pipeline into 2 pipelines (one for loading Azure SQLDB, and the other for archiving the Blob File)
I'm fairly stuck on this one... I will completely remove the archiving and see what happens, but according to all of the Microsoft Documentation, Data Lineage for Azure Blob and Azure SQLDB is supported, so this should be working. If anyone has answers or ideas, I would love to hear them.
Update** My newest theory is that there is a time lag between when you run a pipeline and the Data Lineage is refreshed in Purview... I am going to try disconnecting the Data Factory and Reconnecting.
Update #2** Deleting the Data Factory connection and reconnecting did nothing from what I can tell. I have been playing with how to delete assets via the REST API, which is apparently the only way to delete assets/relationships in Purview... I think my next step will be to delete the Purview Account and Storage.
Update #3*** Even after spinning up a new Purview account, the lineage still fails to show the Blob to Azure SQLDB. I am wondering if it's because of the for each logic I have, but doesn't make sense because the archive copy activity was in the for each as well. I'm at a loss. I have other Copy activities from Blob to Azure SQLDB that work, but not this one.
Thanks
After a LOT of testing. I believe the problem is Purview does not know how to handle Copy Activities that include additional columns.
Does NOT work: With additional columns
Works: Without additional columns
The ONLY difference was the fact one had additional columns mapped, and the other did not. Slight design gap...
I have created this Azure Purview Feature Request! https://feedback.azure.com/forums/932437-azure-purview/suggestions/42357196-allow-data-lineage-to-be-performed-on-azure-data-f
Please vote for this so it can be implemented in a future release.

Cannot create a batch pipeline to get data from ZohoCRM with http plugin 1.2.1 to BigQuery. Retuns Spark Program 'phase-1' failed

My first post here and I'm new to Data Fusion and I'm with low to no coding skills.
I want to get data from ZohoCRM to BigQuery. Module from ZohoCRM (e.g. accounts, contacts...) to be a separate table in BigQuery.
To connect to Zoho CRM I obtained a code, token, refresh token and everything needed as described here https://www.zoho.com/crm/developer/docs/api/v2/get-records.html. Then I ran a successful get records request as described here via Postman and it returned the records from Zoho CRM Accounts module as JSON file.
I thought it will be all fine and set the parameters in Data Fusion
DataFusion_settings_1 and DataFusion_settings_2 it validated fine. Then I previewed and ran the pipeline without deploying it. It failed with the following info from the logs logs_screenshot. I tried to manually enter a few fields in the schema when the format was JSON. I tried changing the format to csv, nether worked. I tried switching the Verify HTTPS Trust Certificates on and off. It did not help.
I'd be really thankful for some help. Thanks.
Update, 2020-12-03
I got in touch with Google Cloud Account Manager, who then took my question to their engineers and here is the info
The HTTP plugin can be used to "fetch Atom or RSS feeds regularly, or to fetch the status of an external system" it does not seems to be designed for APIs
At the moment a more suitable tool for data collected via APIs is Dataflow https://cloud.google.com/dataflow
"Google Cloud Dataflow is used as the primary ETL mechanism, extracting the data from the API Endpoints specified by the customer, which is then transformed into the required format and pushed into BigQuery, Cloud Storage and Pub/Sub."
https://www.onixnet.com/insights/gcp-101-an-introduction-to-google-cloud-platform
So in the next weeks I'll be looking at Data Flow.
Can you please attach the complete logs of the preview run? Make sure to redact any PII data. Also what is the version of CDF you are using? Is CDF instance private or public?
Thanks and Regards,
Sagar
Did you end up using Dataflow?
I am also experiencing the same issue with the HTTP plugin, but my temporary way to go around it was to use a cloud scheduler to periodically trigger a cloud function that fetches my data from the API and exports them as a JSON to GCS, which can then be accessed by Data Fusion.
My solution is of course non-ideal, so I am still looking for a way to use the Data Fusion HTTP plugin. I was able to make it work to get sample data from public API end-points, but for a reason still unknown to me I can't get it to work for my actual API.

How can I troubleshoot the 'No available gateway error' in Power BI?

I am having a trouble with Power BI dataset refresh via REST API.
I get this error whenever I try to execute the dataset refresh API:
Last refresh failed: ...
There is no available gateway.
I'm testing on two accounts, and this happens only on one of them.
What is more confusing is that the storage I'm using is cloud based (Azure Data lake). So it doesn't need a gateway connection. And when I refresh the datasets manually it works.
When I get the refresh history for further investigationI get this:
serviceExceptionJson={"errorCode":"DMTS_MonikerWithUnboundDataSources"}
I could use any given help.
I have found a workaround solution, although I'm not very convinced this is the source of the problem (maybe the why is still very blurry to me)
First for more context, this is what I'm doing (using the APIs) :
1-Upload a dataset to workspace (import a pbix and skipping report, I only need the parameters and power query within it)
2- Update the parameters
3- Refresh dataset
The thing is, I have a parameter for the url of the datalake data. There are two values of the url, depending on the use case/user.
So what I did, is that I removed that parameter and made two different pbix, each one with its specific url. So then, I import a pbix depending on the use case.
I still don't understand why I get a gateway error, while I should get "error while processing data" message (seeing that it seems to be a data access issue).

How to automatically create a new file based on an existing file within Google Cloud Storage

It's the first time I used Google Cloud, so I might ask the question in the wrong place.
 
Information provider upload a new file to Google Cloud Storage every day.
The file contains the information of all my clients/departments.
I have to sort through information and create a new file/s containing the relevant information for each department in my company .so that everyone gets the relevant information to them (security).
I can't figure out what are the steps I need to follow, to complete the task.
Can you help me?
You want to have a process that starts automatically and subsequently generates a new file once you upload something to Google Cloud Storage.
The easiest way to handle this is using Object Change Notifications. You can set up Object Change Notifications per bucket and this will send a POST request to an URL that you can define.
You can then easily set up a server (or run it on app engine) that will execute an action based on the POST request that it receives.
There is an even simpler option (although still in alpha) named cloud functions. Cloud functions is a serverless service that provides event based microservices (e.g. 'do this' if a new file is uploaded on GCS). This means you only have to write the code that defines what needs to happen if a new file is uploaded and then Cloud Functions will take care of executing the code when you upload a file to GCS. See this tutorial on using cloud functions with Google Cloud Storage.