Setting up a BigQuery to Google Cloud Storage pipeline with overwriting - google-cloud-data-fusion

I am trying to setup a really simple pipeline in Data Fusion which takes a table from BigQuery, then stores that data into Google Cloud Storage. With the pipeline setup below it's fairly easy. We first read the bigquery table and schema, then sink the data into a Google Cloud Storage bucket. This works, but the problem is that a new map and a new file gets created for each new transfer that I run. What I would like to do is to overwrite a single file in the same filepath with each new transfer that I do.
What I ran into that in this setup, a new map and a new file gets within Google Cloud Storage created using a timestamp prefix. Looking at the sink configuration below, indeed, by default you see a timestamp.
Alright, that would mean if I would remove the prefix a new map shouldn't be created. The hover-over confirms this: "If not specified, nothing will be appended to the path".
However, when I clear this value and then save it, the full time format automatically pops up again. I can't use a static value because this results in errors. For example I just tried creating a map with the number "12" in Google Cloud Storage and then setting the prefix to that, but as you would guess this doesn't work. Is anyone else running into this problem? How do I get rid of the path suffix so I don't get a new map for each timestamp within Google Cloud Storage?

This seems to be an issue with Data Fusion UI. Have filed a JIRA for this https://issues.cask.co/browse/CDAP-16129.
I understand this can be confusing when you open the configuration again. The reason this is happening is whenever you open the configuraion modal we pre-populate fields with default values from plugin widget json (if no value is present).
As a workaround can you try,
Export pipeline - Once you have configured all the properties in the plugins you can export the pipeline. This step should download a JSON for you where you can locate the property and remove it and import the pipeline and publish without opening the specific plugin.
Or, simply remove the property from the plugin configuration modal and close and publish the pipeline directly. UI will Re-populate the value every time you open the plugin configuration. Once you delete and close the modal it should retain that state until you open the configuration again.
Hope this helps.

Related

ADF doesnt create BlobCreation event with dataflow

I have pipeline and a dataflow activity inside which copies the data to blob storage. I have trigger activated.
Problem is, the trigger works If I place the file manually on storage. But it doesn't get triggered when the dataflow puts file on the blob storage with copy activity.
Here is the trigger info:
The problem is that a sink in dataflow when using parquet format generates a BlobRenamed event instead of BlobCreation. Therefore, the trigger doesn't get the right event.
I tried and it's working fine for me. it is detecting blob is getting added in particular container.
It appears that the trigger is set up to only react when new files are added to the blob storage and not when old files change. It's possible that the dataflow activity updates an existing file rather than producing a new one when it moves data to the blob storage.
Agreed with #Joel Cochran in Blob path ends with you are passing train_data.parquet if trigger did not find any particular file with name contain similar pattern it will not trigger the pipeline.
You may tweak the trigger to look for changes in both new and current files to fix this. This may be achieved by including the only .parquet in the Blob path ends with section of the trigger setup, which will make the trigger react to any changes to files in the supplied path.
Specify the correct details.
It's possible that the trigger is not configured to detect changes made by the dataflow activity. Check the trigger's settings to ensure that it is monitoring the correct blob container and that it is set up to detect the appropriate types of changes, such as new or modified blobs.
I ended up using web activity to send custom blob events to custom events and using custom triggers on the receiving pipeline.

How to set up blob storage Property before web activity drop a file into Blob storage?

I have created pipeline in Azure Data factory in which I'm Have web activity which copy the excel(xlsx) file from Dropbox App console and have another web activity which copy the file into Blob Storage, Pipeline is executing successfully, it is copying the file in same xlsx format in blob Storage as well but when I open the excel file from blob storage getting error that "The file myfilename.xlsx may not render correctly as it contains an unrecognized extension"
when the web activity copy the file I see it has content-Type = Applicatoin/octent-stream, I did try to change the content-type = application/vnd.openxmlformats-officedocument.spreadsheetml.sheet, Any help would be appreciate to set the blob storage property before I copy my excel file from web activity output.
The web activity's associated service and dataset characteristics are to be applied. I'll give you a solution that I know will work in the interim. Use two online activities. The data is initially retrieved from the blob. The first web activity's output is used as the body of the second web activity. In the example below, I substitute another blob for the website. (It essentially duplicates the blob.)
Here, the URL looks like this to obtain the blob:
https://.blob.core.windows.net///;
As well, I employ MSI authentication. You must assign the Data Factory rights in the storage account (or container) access control for this to function.
The header x-ms-version must then be added, along with the value 2017-11-09. (A picture of an earlier solution illustrates this.)
You should use https://storage.azure.com as the resource.
If an error message appears,
AuthorizationPermissionMismatch
This operation cannot be carried out under the terms of this request.Then you might need to visit your storage account and grant additional permissions, as seen below. Wait for the permissions to spread over a few minutes; otherwise, you risk getting a false negative.

How do I resolve a "The policy has been modified by another process" error in Google Cloud SQL?

I'm trying to export the contents of a MySQL table from Google Cloud SQL into a Cloud Storage bucket, and I'm running into an error:
The policy has been modified by another process. Please try again.
Yesterday, I happily imported CSV data to my Cloud SQL database left and right, and when I tried to write some of the modified data from a query out to another CSV file, I got tripped up. So I followed the directions here to try to resolve my issue: https://cloud.google.com/sql/docs/mysql/import-export/exporting?_ga=2.11596404.-1979747439.1580744022
I hit a wall at the end of the day and decided to come back to it later. This morning, I created a new table in my DB and inserted the data into it that I need to export via a query. When I went to export it using the export function in the Cloud SQL console, I get the error message above.
I'm pretty sure I messed something up with permissions somewhere when I was poking around yesterday, but I can't figure out what I did. I'm also having problems with import now, too -- I get a little "Permissions updated" popup, and then this error:
Operation failed: You are not authorized to perform this operation. Learn more
I'd appreciate any insight into how to undo whatever I apparently did.
Never mind! I opened up another project/storage instance and made sure things matched. For future reference, the permissions for the service account on the storage bucket need to be:
Storage Legacy Bucket Reader
Storage Object Admin
Storage Object Viewer

Is there a way to change the google storage signed url to not include the name of the file?

I have a method that gets a signed url for a blob in a google bucket and then returns that to users. Ideally, I could change the name of the file shown as well. Is this possible?
An example is:
https://storage.googleapis.com/<bucket>/<path to file.mp4>?Expires=1580050133&GoogleAccessId=<access-id>&Signature=<signature>
The part that I'd like to set myself is <path to file.mp4>.
The only way I can think of is having something in the middle that will be responsible for the name "swap".
For example Google App Engine with an http trigger or Cloud Function with storage trigger that whenever you need it will fetch the object, rename it, and either provide it to the user directly or store it with the new name in another bucket.
Keep in mind that things you want to store temporarily in GAE or Cloud Functions need to be stored in "/tmp" directory.
Then for renaming, if you are using GAE possibly you can use something like:
import os
os.system([YOUR_SHELL_COMMAND])
However, the easiest but more costly approach is to set a Function with storage trigger that whenever an object is uploaded it will store a copy of it with the desired new name in a different bucket that you will use for the users.

How to Connect CloverETL to Google Cloud Storage?

I am using CloverETL Designer for ETL operations and I want to load some csv files from GCS to my Clover graph. I used FlatFileReader and tried to get file using remote File URL but it is not working. Can someone please detail the entire process here??
The path for file in GCS is
https://storage.cloud.google.com/PATH/Write_to_a_file.csv
And I need to get this csv file into the FlatFileReader in CloverETL Designer
You should use the Google Cloud Storage API to GET the file; Clover's HTTPConnector component will allow you to pass in the appropriate parameters to make a GET request (you will presumably have to do an OAuth2 authentication first to get a token), and send the output to a local destination specified in "Output File URL." Then you can use a FlatFileReader to read from that local file.
GCS has several different ways to download files from your buckets. You can use the console and the Cloud Storage browser. Steps: open the storage browser, navigate to the object you want to download, right click, and save to your chosen local folder. If you use Chrome the save appears as “Save Link As…”.
To use the GS Utility, use this command:
`gsutil cp gs://[BucketName]/[ObjectName] [ObjectDestination]`.
Or you can use client libraries or the REST APIs to download files. With these last options you could work with a number of files or create a job to download them. Once they are in a location known to Clover ETL the process is straightforward.
Within Clover designer, under the navigation pane you can right click a folder and choose import. Pick the one in which you placed your GCS file. Once the file is imported then you can use data from it like any other datafile in Clover. Since this is a .csv file, remember to edit your metadata (right click the component, choose extract metadata then edit inside the Metadata Editor -- for data types, labels and such.) Assign metadata to the edges of your components so they know what is coming in/going out of that step. Depending on your file, this process may be repeated many times.
Even with an ETL tool, getting the data and data types correct can be tricky. If you have questions about how to configure data types or your edges in an ETL project, a wiki may help. The web has additional resources may help you get the end analysis you’re looking for.