Some questions about google Data fusion - google-cloud-data-fusion

I am discovering the tool and I have some questions:
-what do you exactly mean by the type File in (Source, Sink),
-is it also possible to send the result of the pipeline directly to a FTP server
I check the documentation, but I did not find this information
thank you

Short answer: File refers to the filesystem where the pipelines run. In Data Fusion context if you are using File sink the contents will be written to HDFS on Dataproc cluster.
Data Fusion has SFTP put actions that can be used to write to SFTP. Here is a simple pipeline of how to write to SFTP from GCS.
Step1: GCS Source to File Sink - This writes the content of GCS to HDFS on Dataproc when the pipeline is run
Step 2: SFTP Put action, that takes the output of File sink and upload to SFTP.
You need to configure the output path of File the same as source path in SFTP

Related

Retrieve Large XML file from HTTP and copy to Azure SQL

I'm currently using an API call to download an XML file which is 600MB and copy to Azure SQL using the copying activity. Due to the file being large no output have ever succeed as the debug keeps on running and never completing.
I have tried splitting the source file using dataflow, lookup and get meta, however, HTTP request support aren't available for these activities. Is there a way to copy larger xml file from a http request to Azure SQL?

Executing Batch service in Azure Data factory using python script

Hi i've been trying to execute a custom activity in ADF which receives csv file from the container (A) after further transformation on the data set, transformed DF stored into another csv file in a same container (A).
I've written the transformation logic in python and have it stored in the same container (A).
Error raises here, when i execute the pipeline it returns an error *can't find the specified file *
Nothing wrong in the connections, Is anything wrong in batch Account or pools!!
can anyone tell me where to place the python script..!!!
Install azure batch explorer and make sure to choose proper configuration for virtual machine (dsvm-windows) which will ensure python is already in place in the virtual machine where your code is being run.
This video explains the steps
https://youtu.be/_3_eiHX3RKE

Pull from and Push to S3 using Perl

everyone! I have what I assume to be a simple problem, but I could use a hand digging in. I have a server that preprocesses data before translation. This is done by a series of perl scripts developed over a decade ago (but they work!). This virtual server is being lifted into AWS. The change this makes for my scripts is that the location they pull from and the location they write to will be S3 buckets now.
The work flow is: copy all files in the source location to the local drive, preprocess the data file by file, and when complete move the preprocessed files to a final destination.
process_file ($workingDir, $dirEntry);
final_move;
move("$downloadDir/$dirEntry", "$archiveDir") or die "ERROR: Archive file $downloadDir/$dirEntry -> $archiveDir FAILED $!\n";
unlink("$workingDir/$dirEntry");
So, in this case $dir and $archiveDir are S3 buckets.
Any advice on adapting this is appreciated.
TIA,
VtR
You have a few options.
Use a system like s3fs-fuse to mount your S3 bucket as a local drive. This would presumably require the smallest changes to your existing code.
Use the AWS Command Line Interface to copy your files to your S3 bucket.
Use the Amazon API (through something like Paws) to upload your files to S3.

How to download file from url and store it in aws s3 bucket?

as stated, I'm trying to download this dataset of zip folders containing images: https://data.broadinstitute.org/bbbc/BBBC006/ and store them in an s3 bucket so I can later unzip them in the bucket, reorganize them, and pull them in smaller chunks into a vm for some computation. Problem is, I don't know how to get the data from https://data.broadinstitute.org/bbbc/BBBC006/BBBC006_v1_images_z_00.zip for example or any of the other ones, to then send it s3
this is my first time using aws or really any cloud platform so please bear with me :]
Amazon EC2 provides a virtual computer just like a normal Linux or Windows computer.
Amazon S3 is a block storage service where you can upload/download files.
If you wish to copy files from a website to Amazon S3, you will need to write an application or script that will:
Download the files from the website
Upload them to Amazon S3
If you wish to do it from a script, you could use the AWS Command-Line Interface (CLI).
Or, you could do it from a programming language, see: SDKs and Programming Toolkits for AWS

How we can copy any file within Azure Data Lake Store folders

We already have Move-AzureRmDataLakeStoreItemwhich will move files between folders inside Azure datalake. What I am seeking is to copy files within the datalake without effecting the original file.
The possibilities that I know are-
using USQL to EXTRACT data from sourcefile and then OUTPUT to the destinationfile - but I am trying to copy all sort of files (.gz,.txt,.info,.exe,.msi) and I am not sure if USQL can help me with .gz or .exe or .msi files
using Data Factory to copy data from/to Data Lake store
So, my ask here is do we have anything else at our disposal with which we can perform a copy of files within Azure Data Lake Store?
You have couple of other options,
run distcp on an HDI cluster - Similar to instructions provided here. https://learn.microsoft.com/en-us/azure/data-lake-store/data-lake-store-copy-data-wasb-distcp
use adlcopy if you are copying limited amount of data (saying 10-100's of GB) - https://learn.microsoft.com/en-us/azure/data-lake-store/data-lake-store-copy-data-azure-storage-blob
Does this suffice please? Or do you want something natively supported by Azure Data Lake Store via its REST APIs?
Thanks,
Sachin Sheth
Program Manager, Azure Data Lake.