error connecting to azure data lake in azure data factory - azure-data-factory

I am trying to create a linked service in Azure Data Factory to an Azure Data Lake Storage Gen2 data store. Below is my linked service configuration:
I get the following error message when I test the connection:
Error code 24200 Details ADLS Gen2 operation failed for: Storage
operation '' on container 'testconnection' get failed with 'Operation
returned an invalid status code 'Forbidden''. Possible root causes:
(1). It's possible because some IP address ranges of Azure Data
Factory are not allowed by your Azure Storage firewall settings. Azure
Data Factory IP ranges please refer
https://learn.microsoft.com/en-us/azure/data-factory/azure-integration-runtime-ip-addresses..
I have found a very similar question here, but I'm not using Managed Identity as my authentication method. Perhaps I should be using that method. How can I overcome this error?

I tried to create a linked service to my Azure Data Lake storage and when I test its connection, it gives me the same error.
Error code 24200 Details ADLS Gen2 operation failed for: Storage
operation '' on container 'testconnection' get failed with 'Operation
returned an invalid status code 'Forbidden''. Possible root causes:
(1). It's possible because some IP address ranges of Azure Data
Factory are not allowed by your Azure Storage firewall settings. Azure
Data Factory IP ranges please refer
https://learn.microsoft.com/en-us/azure/data-factory/azure-integration-runtime-ip-addresses
As indicated by the Possible root causes in the error details, this occurred because of the Azure data lake storage account firewall settings.
Navigate to your data lake storage account, go to Networking -> Firewalls and virtual networks.
Here, when the public network access is either disabled or enabled from selected virtual networks and IP addresses, the linked service creation fails with the above specified error message.
Change it to Enabled from all networks save the changes and try creating the linked service again.
When we test the connection before creating the linked service, it will be successful, and we can proceed to create it.
UPDATE:
In order to proceed with a data lake storage with public access enabled from selected virtual netowrks and IP addresses to create a successful connection via linked service, you can use the following approach.
Assuming your data lake storage has public network access enabled from selected virtual netowrks and IP addresses, first create an integration runtime in your azure data factory.
In your data factory studio, navigate to Manage -> Integration Runtime -> New. Select Azure,self hosted as the type of integration runtime.
Select Azure in the next window and click continue. Enter the details for integration runtime
In the virtual network tab, enable the virtual network configuration and check the interactive authoring checkbox.
Now continue to create the Integration runtime. Once it is up and running, start creating the linked service for data lake storage.
In Connect via integration runtime, select the above created IR. In order to complete the creation, we also need to create a managed private endpoint (It will be prompted as shown in the image below).
Click Create new, with account selection method as From azure subscription, select the data lake storage you are creating the linked service to and click create.
Once you create this, a private endpoint request will be sent to your data lake storage account. Open the storage account, navigate to Networking -> Private endpoint connections. You can see a pending request. Approve this request.
Once this is approved, you can successfully create the linked service where your data lake storage allows access on selected virtual networks and IP addressess.

The error has occurred because of firewall and network access restriction. One way to overcome this error is by adding your client ip to the firewall and network setting of your storage account. Navigate to your data lake storage account, go to Networking -> Firewalls and virtual networks. Under firewall option click on "Add your client ip address"

Related

How to use Azure Data Factory, Key Vaults and ADF Private Endpoints together

I've created new ADF instance on Azure with Managed Virtual Network integration enabled.
I planned to connect to Azure Key Vault to retrieve credentials for my pipeline’s source and sink systems using Key Vault Private Endpoint. I was able to successfully create it using Azure Data Factory Studio. I have also created Azure Key Vault linked service.
However, when I try to configure another Linked Services for source and destination systems the only option available for retrieving credentials from Key Vault is AVK Linked Service. I'm not able to select related Private Endpoint anywhere (please see below screen).
Do I miss something?
Are there any additional configuration steps required? Is the scenario I've described possible at all?
Any help will be appreciated!
UPDATE: Screen comparing 2 linked services (one with managed network and private endpoint selected and another one where I'm not able to set this options up):
Managed Virtual Network integration enabled, Make sure check which region you are using unfortunately ADF managed virtual network is not supported for East Asia.
I have tried in my environment even that option is not available
So, I have gathered some information even if you create a private endpoint for Key Vault, this column is always shown as blank .it validates URL format but doesn't do any network operation
As per official document if you want to use new link service, instead of key vault try to create other database services like azure sql, azure synapse service like as below
For your Reference:
Store credentials in Azure Key Vault - Azure Data Factory | Microsoft Docs
Azure Data Factory and Key Vault - Tech Talk Corner

How to Manage IBM Cloud Key-Protect Instance from CLI when Private Network Only Policy is Applied?

In doing some testing of the IBM Cloud Security and Compliance items, specifically the CIS Benchmarks for Best Practices, one item I was non-compliant on was in Cloud Key protect for the Goal "Check whether Key Protect is accessible only by using private endpoints"
My Key-protect instance was indeed set to "Public and Private" so I changed it to Private. This change now requires me to manage my Key-Protect instance from the CLI.
When I try to even look at my Key-Protect instance policy from the CLI I receive the following error:
ibmcloud kp instance -i my_instance_id policies
Retrieving policy details for instance: my_instance_id...
Error while getting instance policy: kp.Error: correlation_id='cc54f61d-4424-4c72-91aa-d2f6bc20be68', msg='Unauthorized: The user does not have access to the specified resource'
FAILED
Unauthorized: The user does not have access to the specified resource
Correlation-ID:cc54f61d-4424-4c72-91aa-d2f6bc20be68
I'm confused - I am running the CLI logged, in as the tenant admin with Access policy of All resources in account (including future IAM enabled services)
What am I doing wrong here?
Private endpoints are only accessible from within IBM Cloud. If you connect from the public internet, access should be blocked.
There are multiple ways, how to work with such a policy in place. One is to deploy (a VPC with) a virtual machine on a private network. Then, connect to it with a VPN or Direct Link. Thus, your resources are not accessible from the public internet, but only through private connectivity. You could continue to use the IBM Cloud CLI, but set it to use private endpoints.

Linked Service with self-hosted integration runtime is not supported in data flow in Azure Data Factory

Step to reproduce:
I created a Copy Data first in the pipeline to simple transfer CSV files frol Azure VM to Azure Blob storage. I always use IRPOC1 as a connection via integration runtime and connect using SAS URI and SAS Token to my Blob Storage
After validate and run my first Copy Data, I successfully have CSV file transfer from my VM to Blob storage
I tried to add a new Data Flow after the Copy Data activity
In my Data Flow, my source is the Blob storage containing the CSV files transferred from VM, my Sink is my Azure SQL Database with successful connection
However, when I ran validation, I got the error message on my Data Flow Source:
Linked Service with self-hosted integration runtime is not supported in data flow.
I saw someone replied on Microsoft Azure Document issue Github that I need to use Copy Data to transfer data to Blob first. Then use the source from this blob with data. This is what I did but I still have the same error. Could you please let me know how I can fix this?
The Data Flow source dataset must use a Linked Service that uses an Azure IR, not a self-hosted IR.
Go to the dataset in your data flow Source, click "Open". In the dataset page, click "Edit" next to Linked Service.
In the Linked Service dialog, make sure you are using an Azure Integration Runtime, not a Self-hosted IR.

Use only a domain and disable https://storage.googleapis.com url access

I am newbie at cloud servers and I've opened a google cloud storage to host image files. I've verified my domain and configured it, to view images via my domain. The problem is, same file is both accessible via my domain example.com/images/tiny.png and also via storage.googleapis.com/example.com/images/tiny.png Is there any solution to disable access via storage.googleapis.com and use only my domain?
Google Cloud Platform Support Version:
NOTE: This is the reply from Google Cloud Platform Support when contacted via email...
I understand that you have set up a domain name for one of your Cloud Storage buckets and you want to make sure only URLs starting with your domain name have access to this bucket.
I am afraid that this is not possible because of how Cloud Storage permission works.
Making a Cloud Storage bucket publicly readable also gives each of its files a public link. And currently this public link can’t be disabled.
A workaround would be implement a proxy program and running it on a Compute Engine virtual machine. This VM will need a static external IP so that you can map your domain to it. The proxy program will be in charged of returning the requested file from a predefined Cloud Storage bucket while the bucket keeps to be inaccessible to the public.
You may find these documents helpful if you are interested in this workaround:
1. Quick start to set up a Linux VM (1).
2. Python API for accessing Cloud Storage files (2).
3. How to download service account keys to grant a program access to a set of services (3).
4. Pricing calculator for getting a picture on how much a VM may cost (4).
(1) https://cloud.google.com/compute/docs/quickstart-linux
(2) https://pypi.org/project/google-cloud-storage/
(3) https://cloud.google.com/iam/docs/creating-managing-service-account-keys
(4) https://cloud.google.com/products/calculator/
My Version:
It seems the solution to this question is really a simple, just FUSE Google Cloud Storage with VM Instance.
After FUSE private files from GCS can be accessed through VM's IP address. It made Google Cloud Storage Bucket act like a directory.
The detailed documentation about how to setup FUSE in Google Cloud is here.
There is but it requires you to do more work.
Your current solution works because you've made access to the GCS bucket (example.com), public and then you're DNS aliasing from your domain.
An alternative approach would be for you to limit access to the GCS bucket to one (possibly several) accounts and then run a web-server that uses one of the accounts to access your image files. You could then also either permit access to your web-server to anyone or also limit access to it.
More work for you (and possibly cost) but more control.

Data Transfer between Google Storage different Service Accounts

I have two Google Service Credentials and a bucket on each account .I have to transfer files from one bucket to another. How can I do this programmatic ally?
Can I achieve this with two Storage objects or using the Cloud storage Transfer service?
Yes, with Storage Transfer Service you can create a transfer job and send the data to a destination bucket (in another project), keep in mind that it is documented that:
To access the data source and the data sink, this service account must
have source permissions and sink permissions.
Meaning that you can't use two different service accounts, you will need to grant access to only one of the two service accounts you have.
If you want to transfer files from one bucket to another programmatically. First, you must grant permission to the service account associated with the Storage Transfer Service so it can access the data sink(destination bucket), please follow these steps.
Please note that if you are not creating the transfer job in the same project where the source bucket is located, then you must grant permissions to access it.
With Storage Transfer Service you can create a transfer job programmatically with Java and Python, examples include creating the transfer job and checking the transfer operation status. Full code example can be found for Java and Python.