Move only files that were read Google Cloud Data Fusion pipeline - google-cloud-data-fusion

Within a pipeline with executions in a limited time (30 minutes) that has as its source a GCS bucket and as a target BigQuery, after processing each file I want to move only the files that were executed in the pipeline, however in conditions and actions only GCS move is available, the difficulty is that it does not allow to discriminate the files in the source bucket and moves all the content which generates a loss of data when an execution starts after the first one takes more than 30 minutes.
Any ideas on how to approach this case?
my pipeline looks like this

The GCS Move plugin does not support filters, which would have helped I guees. There is an existing JIRA - https://cdap.atlassian.net/browse/PLUGIN-698 to track.
A workaround is to use File Move Plugin which has wildcard support.

Related

Optimize Azure Data Factory copy of 10.000+ JSON files from BLOB storage to ADLS G2

Situation: Every day a bunch of JSON files are generated and put into Azure BLOB storage. Also every day an Azure data factory copy jobs makes a look up in the blob storage and does a "Filter by last modified":
Start time: #adddays(utcnow(),-2)
End time: #utcnow()
The files are copied to Azure Datalake Gen2.
On normal days with 50-100 new JSON-files the copy jobs goes fine but at the last day of every quarter the number of new JSON-files increases to 10.000+ files and then the copy job fails with the message "ErrorCode=SystemErrorFailToInsertSubJobForTooLargePayload,….."
Therefore I have made a new copy job that uses a for each loop to run parallel copy jobs. This can copy much larger volumes of files, but it still takes a couple of minutes per file and I have not seen more than around 500 files per hour being copied, so that is still not fast enough.
Therefore I am searching for more ways to optimize the copy. I have put in a couple of screen shots but can give more details on specifics.
The issue is with the size of payload which is unable to process using the current configuration (expecting you are using default settings).
You can optimize the Copy activity performance by considering the underlying changes in your Azure Data Factory (ADF) environment.
Data Integration Units
Self-hosted integration runtime scalability
Parallel copy
Staged copy
You can try these Performance Tuning Steps in your ADF to increase the performance.
Configure the copy optimization features in settings tab.
Refer Copy activity performance optimization features for more details and better understanding.

Performance issues when merging large files into one single file

I have a pipeline contains multiple copy activity, and the main purpose of these activities is to merge multiples files into one single file.
the problem of this pipeline is, it takes about 4 hours to executes (to merge the files).is there any way to reduce the duration please.
thanks for your reply .
If the copy operation is being performed on an Azure integration
runtime, the following steps must be followed:
For Data Integration Units (DIU) and parallel copy settings, start with the
default values.
If you're using a self-hosted integration runtime, you'll need to do
the following:
Would recommend that you run IR on a separate computer. The machine should
be kept isolated from the data store server. Start using the default
defaults for parallel copy settings and the self-hosted IR on a single
node.
Else you may leverage:
A Data Integration Unit (DIU)
It is a measure that represents the power of a single unit in Azure Data Factory and Synapse pipelines. Power is a combination of CPU, memory, and network resource allocation. DIU only applies to Azure integration runtime. DIU does not apply to self-hosted integration runtime.
Parallel Copy
Could set the parallel Copies property to indicate the parallelism you want the copy activity to use. Think of this property as the maximum number of threads within the copy activity. The threads operate in parallel. The threads either read from your source, or write to your sink data stores.
Here, is the MSFT Document to Troubleshoot copy activity performance.
When copying data into Azure Table, default parallel copy is 4.he
range of DIU setting is 2-256.However, specific behaviors of DIU in
different copy scenarios are different even though you set the number
as you want.
Please see the table list here,especially for the below part
DIU has some limitations as you seen,so you could choose the optimal setting with your custom scenario.
If you are trying to copy 1GB data, thus somehow DIU never crossed 4.
But when If you try to copy 10GB data, then you could notice DIU started scaling up beyond 4.
Here is the list of the Data Integration Units.

Always read latest folder from s3 bucket in spark

Below is how my s3 bucket folder structure looks like,
s3://s3bucket/folder1/morefolders/$folder_which_I_want_to_pick_latest/
$folder_which_I_want_to_pick_latest - This folder can always have an incrementing number for every new folder that comes in, like randomnumber_timestamp
Is there a way I can automate this process by always reading the most recent folder in s3 from spark in Scala
The best way to work with that kind of "behavior" is structure your data as a partitioned approach, like year=2020/month=02/day=12, where, every partition is a folder (in aws-console). In this way you can use a simple filter on spark to determine the latest one. (more info: https://www.datio.com/iaas/understanding-the-data-partitioning-technique/)
However, if you are not allowed to re-structure your bucket, the solution could be costly if you don't have a specific identifier and/or reference that you can use to calculate your newest folder. Remember, that in s3 you don't have a concept of folder, you have only an object key (here is where you see the / and in aws console can be visualized as folders), so, to calculate the highest incremental id in $folder_which_I_want_to_pick_latest will eventually check in all the objects stored in the bucket and every object-request in s3 costs. More info: https://docs.aws.amazon.com/AmazonS3/latest/user-guide/using-folders.html.
Here's one option. Consider writing a Lambda function that either runs on a schedule (say if you knew that your uploads always happen between 1pm and 4pm) or is triggered by an S3 object upload (so it happens for every object uploaded to folder1/morefolders/).
The Lambda would write the relevant part(s) of the S3 object prefix into a simple DynamoDB table. The client that needs to know the latest prefix would read it from DynamoDB.

Use of spark to optimize S3 to S3 transfer

I am learning spark/scala and trying to experiment with the below scenario using scala language.
Scenario: Copy multiple files from one S3 bucket folder to another S3 bucket folder.
Things done so far:
1) Use AWS S3 SDK and scala:
- Create list of files from S3 source locations.
- Iterate through the list, pass the source and target S3 locations from step 1 and use S3 API copyObject to copy each of these files to the target locations (configured).
This works.
However, I am trying to understand if I have large number of files inside multiple folders, is this the most efficient way of doing or can I use spark to parallelize this copy of files?
The approach that I am thinking is:
1) Use S3 SDK to get the source paths similar to what's explained above
2) Create an RDD for each of the files using sc.parallelize() - something on these lines?
sc.parallelize(objs.getObjectSummaries.map(_.getKey).toList)
.flatMap { key => Source.fromInputStream(s3.getObject(bucket, key).getObjectContent: InputStream).getLines }
3) Can I use sc.wholeTextFiles in some way to make this work?
I am not sure how to achieve this as of now.
Can you please help me understand if I am thinking in the right direction and also is this approach correct?
Thanks
I think AWS did not make it complicated though.
We had the same problem, we transferred around 2TB close to 10 mins.
If you want to transfer from one bucket to another bucket, better to use the built-in functionality to transfer within s3 itself.
https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html
AWS CLI Command Example:
aws s3 sync s3://sourcebucket s3://destinationbucket
If you want to do it programmatically you can use all SDK's to invoke the same type of command. I would avoid reinventing the same wheel.
Hope it helps.
I have a code snipped, cloudCp which uses spark for a high-performance parallelised upload; it'd be similar to do something for copy, where you'd drop to the AWS lib for that operation.
But: you may not need to push out work to many machines, as each of the PUT/x-copy-source calls may be slow, but it doesn't use any bandwidth. You could just start a process with many many threads & a large HTTP client pool and just run them all on in that process. Take the list, sort by largest few first and then shuffle the rest at random to reduce throttling effects. Print out counters to help profile...

Compress files saved in Google cloud storage

Is it possible to compress a file already saved in Google cloud storage?
The files are created and populated by Google dataflow code. Dataflow cannot write to a compressed file but my requirement is to save it in compressed format.
Writing to compressed files is not supported on the standard TextIO.Sink because reading from compressed files is less scalable -- the file can't be split across multiple workers without first being decompressed.
If you want to do this (and aren't worried about potential scalability limits) you could look at writing a custom file-based sink that compresses the files. You can look at TextIO for examples and also look at the docs how to write a file-based sink.
The key change from TextIO would be modifying the TextWriteOperation (which extends FileWriteOperation) to support compressed files.
Also, consider filing a feature request against Cloud Dataflow and/or Apache Beam.
Another option could be to change your pipeline slightly.
Instead of your pipeline writing directly to GCS, you could write to a table(s) in BigQuery, and then when your pipeline is finished simply kick off a BigQuery export job to GCS with GZIP compression set.
https://cloud.google.com/bigquery/docs/exporting-data
https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.extract.compression
You could write an app (perhaps using App Engine or Compute Engine) to do this. You would configure notifications on the bucket so your app is notified when a new object is written, and then runs, reads the object, compresses it, and overwrites the object and sets the Content-Encoding metadata field. Because object writes are transactional the compressed form of your object wouldn't become visible until it's complete. Note that if you do this any apps/services that consume the data would need to be able to handle either compressed or uncompressed formats. As an alternative, you could change your dataflow setup so it outputs to a temporary bucket, and set up notifications for that bucket to cause your compression program to run -- and that program then would write the compressed version to your production bucket and delete the uncompressed object.