ADF doesnt create BlobCreation event with dataflow - azure-data-factory

I have pipeline and a dataflow activity inside which copies the data to blob storage. I have trigger activated.
Problem is, the trigger works If I place the file manually on storage. But it doesn't get triggered when the dataflow puts file on the blob storage with copy activity.
Here is the trigger info:

The problem is that a sink in dataflow when using parquet format generates a BlobRenamed event instead of BlobCreation. Therefore, the trigger doesn't get the right event.

I tried and it's working fine for me. it is detecting blob is getting added in particular container.
It appears that the trigger is set up to only react when new files are added to the blob storage and not when old files change. It's possible that the dataflow activity updates an existing file rather than producing a new one when it moves data to the blob storage.
Agreed with #Joel Cochran in Blob path ends with you are passing train_data.parquet if trigger did not find any particular file with name contain similar pattern it will not trigger the pipeline.
You may tweak the trigger to look for changes in both new and current files to fix this. This may be achieved by including the only .parquet in the Blob path ends with section of the trigger setup, which will make the trigger react to any changes to files in the supplied path.
Specify the correct details.
It's possible that the trigger is not configured to detect changes made by the dataflow activity. Check the trigger's settings to ensure that it is monitoring the correct blob container and that it is set up to detect the appropriate types of changes, such as new or modified blobs.

I ended up using web activity to send custom blob events to custom events and using custom triggers on the receiving pipeline.

Related

How to set up blob storage Property before web activity drop a file into Blob storage?

I have created pipeline in Azure Data factory in which I'm Have web activity which copy the excel(xlsx) file from Dropbox App console and have another web activity which copy the file into Blob Storage, Pipeline is executing successfully, it is copying the file in same xlsx format in blob Storage as well but when I open the excel file from blob storage getting error that "The file myfilename.xlsx may not render correctly as it contains an unrecognized extension"
when the web activity copy the file I see it has content-Type = Applicatoin/octent-stream, I did try to change the content-type = application/vnd.openxmlformats-officedocument.spreadsheetml.sheet, Any help would be appreciate to set the blob storage property before I copy my excel file from web activity output.
The web activity's associated service and dataset characteristics are to be applied. I'll give you a solution that I know will work in the interim. Use two online activities. The data is initially retrieved from the blob. The first web activity's output is used as the body of the second web activity. In the example below, I substitute another blob for the website. (It essentially duplicates the blob.)
Here, the URL looks like this to obtain the blob:
https://.blob.core.windows.net///;
As well, I employ MSI authentication. You must assign the Data Factory rights in the storage account (or container) access control for this to function.
The header x-ms-version must then be added, along with the value 2017-11-09. (A picture of an earlier solution illustrates this.)
You should use https://storage.azure.com as the resource.
If an error message appears,
AuthorizationPermissionMismatch
This operation cannot be carried out under the terms of this request.Then you might need to visit your storage account and grant additional permissions, as seen below. Wait for the permissions to spread over a few minutes; otherwise, you risk getting a false negative.

Azure Logic App blob added or modified trigger not working correctly when app is cloned

I have a Azure blobstorage with 3 different containers, lets call them container-a, container-b and container-c where more or less frequently data is uploaded as txt files.
I then created a logic app with When a blob is added or modified trigger and connected it to to container-a - worked like a charme.
So i cloned the LogicApp and then connected them with the according blobs container-b or rather container-c, but the trigger is fired with blobs which were added to container-a in both clones.
I checked all the Triggers settings, but everything looks quite okay to me.
FYI:
I edited the question, since it only seemed to occur with my cloned Logic Apps using that trigger.
I have to proof if i can recap the issue
I found the error - it is definitely a bug:
When selecting a new container for an existing blobstorage
connection, this container is saved, but in the background (to be
checked in the Logic app code view option under definition -> triggers -> metadata) this container is created incorrectly -
because twice.
This apparently seems to result in simply using the first
container (in an alphanumeric order) of the connected blobstorage.
Deleting this duplicate entry mentioned above in the code view did something for me in one case, but not in the other.
In the end, the best thing to do is to
completely reconnect the blobstorage in the Trigger step of the ´Logic Apps Designer` and then select the desired container there again.

Is there a way to change the google storage signed url to not include the name of the file?

I have a method that gets a signed url for a blob in a google bucket and then returns that to users. Ideally, I could change the name of the file shown as well. Is this possible?
An example is:
https://storage.googleapis.com/<bucket>/<path to file.mp4>?Expires=1580050133&GoogleAccessId=<access-id>&Signature=<signature>
The part that I'd like to set myself is <path to file.mp4>.
The only way I can think of is having something in the middle that will be responsible for the name "swap".
For example Google App Engine with an http trigger or Cloud Function with storage trigger that whenever you need it will fetch the object, rename it, and either provide it to the user directly or store it with the new name in another bucket.
Keep in mind that things you want to store temporarily in GAE or Cloud Functions need to be stored in "/tmp" directory.
Then for renaming, if you are using GAE possibly you can use something like:
import os
os.system([YOUR_SHELL_COMMAND])
However, the easiest but more costly approach is to set a Function with storage trigger that whenever an object is uploaded it will store a copy of it with the desired new name in a different bucket that you will use for the users.

Setting up a BigQuery to Google Cloud Storage pipeline with overwriting

I am trying to setup a really simple pipeline in Data Fusion which takes a table from BigQuery, then stores that data into Google Cloud Storage. With the pipeline setup below it's fairly easy. We first read the bigquery table and schema, then sink the data into a Google Cloud Storage bucket. This works, but the problem is that a new map and a new file gets created for each new transfer that I run. What I would like to do is to overwrite a single file in the same filepath with each new transfer that I do.
What I ran into that in this setup, a new map and a new file gets within Google Cloud Storage created using a timestamp prefix. Looking at the sink configuration below, indeed, by default you see a timestamp.
Alright, that would mean if I would remove the prefix a new map shouldn't be created. The hover-over confirms this: "If not specified, nothing will be appended to the path".
However, when I clear this value and then save it, the full time format automatically pops up again. I can't use a static value because this results in errors. For example I just tried creating a map with the number "12" in Google Cloud Storage and then setting the prefix to that, but as you would guess this doesn't work. Is anyone else running into this problem? How do I get rid of the path suffix so I don't get a new map for each timestamp within Google Cloud Storage?
This seems to be an issue with Data Fusion UI. Have filed a JIRA for this https://issues.cask.co/browse/CDAP-16129.
I understand this can be confusing when you open the configuration again. The reason this is happening is whenever you open the configuraion modal we pre-populate fields with default values from plugin widget json (if no value is present).
As a workaround can you try,
Export pipeline - Once you have configured all the properties in the plugins you can export the pipeline. This step should download a JSON for you where you can locate the property and remove it and import the pipeline and publish without opening the specific plugin.
Or, simply remove the property from the plugin configuration modal and close and publish the pipeline directly. UI will Re-populate the value every time you open the plugin configuration. Once you delete and close the modal it should retain that state until you open the configuration again.
Hope this helps.

AWS CloudFormation: Detect Drift

While updating s3 bucket name through cloudformation, Its getting UPDATE_ROLL_BACK automatically and Please let Is it possible to update S3 bucket name through cloudformation and how drift detecting works?.
Updating bucket name requires a replacement. That means that CloudFormation will delete the bucket and then create a new one with the new name. CloudFormation won't delete buckets unless they're empty. That's probably why it fails for your case. To confirm this go to the CloudFormation console page, click on the stack and go to the events tab. Look at some of the latest events and one of them should be about failing to delete the bucket.
To get around this you need to empty your bucket before updating your stack. You probably want to backup all of its content, update the stack, and upload the content back to the new bucket.