How to copy files and folder from one ADLS to another one on different subscription? - azure-data-factory

I need to be able to copy files and folder from one DataLake to another DataLake on a different subscription, I'm in possession of both Auth Token and secret key.
I've tried different solution including:
https://medium.com/azure-data-lake/connecting-your-own-hadoop-or-spark-to-azure-data-lake-store-93d426d6a5f4
which is involving hadoop but didn't worked on two different subscriptions, due to the site-core.xml which only accept one subscription.
ADLcopy didn't worked as well, neither DataFactory.
Any ideas? Can anyone point me in the right direction?

Azure Data Factory supports such scenario, you need to create 2 AzureDataLakeStoreLinkedService, each with correct corresponding subscription and resource group (they don't necessarily be the same with the subscription of the data factory), and the credential of the service principal to access ADLS.
If this doesn't answer your question, could you tell more of your scenario, as I don't understand this: " I'm trying to add both secret key and tokens of the two dirrent subscriptions in the core-site.xml", do you mean

Related

Pipeline run check from different ADF

We need to make sure that below scenarios should be working.
same subscription with different Azure Data Factory
different subscription with different Azure Data Factory
Please provide your pros and cons each statement.
As long as all the subscription are within the same tenant, it won't matter whether the adf pipeline is in same subscription or not. The process to get the status would remain the same

how to plan google cloud storage bucket creation when working with users

I'm trying to figure out if anyone can offer advice around bucket creation for an app that will have users with an album of photos. I was initially thinking of creating a single bucket and then prefixing the filename by user id, since google cloud storage doesn't recognize subdirectories, like so: /bucket-name/user-id1/file.png
Alternatively, I was considering creating a bucket and naming it by user id like so: /user-id1-which-is-also-bucket-name/file.png
I was wondering what I should consider in terms of cost and organization when setting up my google cloud storage. Thank you!
There is no difference in term of cost. In term of organization, it's different:
For the deletion, it's simpler to delete a bucket and not a folder in the unique bucket.
For performances, sharding is better is you have separate bucket (you have less chance to create an hotspot)
At billing perspective, you can add labels on the buckets, and get them in the billing exported to BigQuery. You can know the cost of the bucket per user, and maybe do a rebill to them
The biggest advantage of 1 bucket per user model is the security. You can grant a user (if the users have direct access to the bucket and don't use a backend service to access it) on a bucket, without the use of legacy (and almost deprecated) ACL on object. In addition, if you use ACL, you can't set ACL per folder, ACL are per object. So, everytime that you add an object in the unique bucket, you have to set the ACL on it. It's harder to achieve.
IMO, 1 bucket per user is the best model.

How to prevent creating too many secrets in HashiCorp Vault

I am looking for a better solution for secrets rotation and found Vault dynamic secrets is a good one. By enabling secrets engine, say to database, the applications / services can lease dynamic secrets.
I noticed every time the application to lease a database secret, Vault creates a new user / account in the DB. I understand, each application / service needs to be a good citizen and uses the secret as per the lease time. However, in a microservices environment, an implementation bug may cause the services to request too many dynamic secrets thus triggering to create too many accounts in the DB.
Is there any way to prevent creating too many accounts? I am just worrying too many accounts may cause problem in the DB.
You could go down the static roles which would created one role with a fixed user name and then vault would just rotate that password when it needs to be rotated.
Here are some docs to get you started
https://www.vaultproject.io/api/secret/databases#create-static-role
https://www.vaultproject.io/docs/secrets/databases#static-roles
Also, a warning from the website:
Not all database types support static roles at this time. Please
consult the specific database documentation on the left navigation or
the table below under Database Capabilities to see if a given database
backend supports static roles.`

Azure Batch support for Data Lake Store Linked Service

I am using a data factory pipeline with a custom activity (configured to run on Azure Batch) that has a data lake store input dataset and output dataset. The data lake store linked service is using service to service auth (service principal) and is working fine while being used in a Copy activity through Copy Wizard. But when used with a custom activity that tries to check if a file is present in the data lake, the activity fails with an error "Authorization is required". Upon using a Azure Blob Store as the input and output datasets, the same custom activity works fine.
Seems like an issue with Azure Batch (Compute node) not able to authorize Data Lake Store. Please help if you have solved the above mentioned problem.
I had this exact same issue about 3 weeks ago. I feel your pain!
This is a Microsoft bug!
After much trial and error and redeployments I raised a support ticket with Microsoft who confirmed that service principal authentication for data lake store currently only works with copy activities. Not with custom activities.
This is the official response I got on Monday 10th April.
The issue happen because of a bug that custom activity’s connector
schema doesn’t match the latest published connector schema. Actually,
we notice the issue on custom activity and have plan to fix & deploy
to prod in next 2 weeks.
Be aware that if you change your linked service back to use a session token etc you'll also need to redeploy your pipelines that contain the custom activities. Otherwise you'll get another error something like the following...
Access is forbidden, please check credentials and try again. Code:
'AuthenticationFailed' Message: 'Server failed to authenticate the
request. Make sure the value of Authorization header is formed
correctly including the signature.
Hope this helps.

Google storage public file security - access without a link?

I need to use google cloud storage to store some files that can contain sensitive information. File names will be generated using crypto function, and thus unguessable. Files will be made public.
Is it safe to assume that the file list will not be available to public ? I.e. file can only be accessed by someone who knows the file name.
I have ofc tried accessing the parent dir and bucket, and I do get rejected with unauthenticated error. I am wondering if there is or will ever be any other way to list the files.
Yes, that is a valid approach to security through obscurity. As long as the ACL to list the objects in a bucket is locked down, your object names should be unguessable.
However, you might consider using Signed URLs instead. They can have an expiration time set so it provides extra security in case your URLs are leaked.
Yes, but keep in mind that the ability to list objects in a bucket is allowed for anyone with read permission or better on the bucket itself. If your object names are secret, make sure to keep the bucket's read permissions locked down as much as possible.
jterrace's suggestion about preferring signed URLs is a good one. The major downside to obscure object names is that it's very difficult to remove access to a particular entity later without deleting the resource entirely.