Use of spark to optimize S3 to S3 transfer - scala

I am learning spark/scala and trying to experiment with the below scenario using scala language.
Scenario: Copy multiple files from one S3 bucket folder to another S3 bucket folder.
Things done so far:
1) Use AWS S3 SDK and scala:
- Create list of files from S3 source locations.
- Iterate through the list, pass the source and target S3 locations from step 1 and use S3 API copyObject to copy each of these files to the target locations (configured).
This works.
However, I am trying to understand if I have large number of files inside multiple folders, is this the most efficient way of doing or can I use spark to parallelize this copy of files?
The approach that I am thinking is:
1) Use S3 SDK to get the source paths similar to what's explained above
2) Create an RDD for each of the files using sc.parallelize() - something on these lines?
sc.parallelize(objs.getObjectSummaries.map(_.getKey).toList)
.flatMap { key => Source.fromInputStream(s3.getObject(bucket, key).getObjectContent: InputStream).getLines }
3) Can I use sc.wholeTextFiles in some way to make this work?
I am not sure how to achieve this as of now.
Can you please help me understand if I am thinking in the right direction and also is this approach correct?
Thanks

I think AWS did not make it complicated though.
We had the same problem, we transferred around 2TB close to 10 mins.
If you want to transfer from one bucket to another bucket, better to use the built-in functionality to transfer within s3 itself.
https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html
AWS CLI Command Example:
aws s3 sync s3://sourcebucket s3://destinationbucket
If you want to do it programmatically you can use all SDK's to invoke the same type of command. I would avoid reinventing the same wheel.
Hope it helps.

I have a code snipped, cloudCp which uses spark for a high-performance parallelised upload; it'd be similar to do something for copy, where you'd drop to the AWS lib for that operation.
But: you may not need to push out work to many machines, as each of the PUT/x-copy-source calls may be slow, but it doesn't use any bandwidth. You could just start a process with many many threads & a large HTTP client pool and just run them all on in that process. Take the list, sort by largest few first and then shuffle the rest at random to reduce throttling effects. Print out counters to help profile...

Related

gcloud run that requires large databse

i hope this is the right place to ask this. So what i want to do is perform a sequence search against a large database and create an API for this. I expect that this service will be accessed VERY rarely, as such I though about gcloud run, because this only bills me for each use case (and i dont use a lot). I already have a docker container configured that does what i expect it to, however I have an issue with the data thats required. I need a Database thats roughly 100 GB large. Is there a way to access this in glcoud run?
What would be the optimal way for me to get there? I think downlading 100GB of data every time a request is made is a waste. Maybe I could fetch a zip file from a storage bucket and inflate it in the run instance? But I am not sure if there is even that much space available.
Thank you
I believe the simpler way to do this is to rip the weight of the Cloud Run shoulders.
I'm assuming it is some sort of structured data (json, csv, etc) - if it really is it is simpler to import this data into BigQuery and make your Cloud Run service to query against BQ.
This way your API will answer way faster, you will save costs from running Cloud Run with very large instances to load into memory part of those 100gigs as also you will separate your architecture in layers (ie. an application layer and a data layer).

Merging files within the Azrue Data Lake container

Consider below scenario:
I want my data flow like following
import container ---> databricks (Transform)--> export container
Current situation after I am done with the transformation process
container:
---import
--folder
--mydata.csv
---export
--folder
--part-1-transformed-mydata.csv
--part-2-transformed-mydata.csv
--part-3-transformed-mydata.csv
--initial.txt
--success.txt
--finish.txt
I want below structure:
---import
--folder
--mydata.csv
---export
--folder
--transformed-mydata.csv
What should be preferred way (considering data is of few GBs <10) within data-bricks or I am happy to use any functionality in Data Factory as I am using this data-bricks notebook as a step in pipeline .
Note : I am using Apache Spark 3.0.0, Scala 2.12 in data-bricks with 14 GB Memory, 4 Cores. Cluster type is standard
you will either need to repartition the data into a single partition (note this defeats the point of using a distributed computing platform)
or after the files are generated simply run a command to concat them all into a single file this might be problematic if each file has a header you will need to account for that in your logic.
it might be better to think of the export folder as the "file" if that makes sense. doesn't solve your problem but unless you need to produce a single file for some reason most consumer wont have an issue reading the data in a directory

Always read latest folder from s3 bucket in spark

Below is how my s3 bucket folder structure looks like,
s3://s3bucket/folder1/morefolders/$folder_which_I_want_to_pick_latest/
$folder_which_I_want_to_pick_latest - This folder can always have an incrementing number for every new folder that comes in, like randomnumber_timestamp
Is there a way I can automate this process by always reading the most recent folder in s3 from spark in Scala
The best way to work with that kind of "behavior" is structure your data as a partitioned approach, like year=2020/month=02/day=12, where, every partition is a folder (in aws-console). In this way you can use a simple filter on spark to determine the latest one. (more info: https://www.datio.com/iaas/understanding-the-data-partitioning-technique/)
However, if you are not allowed to re-structure your bucket, the solution could be costly if you don't have a specific identifier and/or reference that you can use to calculate your newest folder. Remember, that in s3 you don't have a concept of folder, you have only an object key (here is where you see the / and in aws console can be visualized as folders), so, to calculate the highest incremental id in $folder_which_I_want_to_pick_latest will eventually check in all the objects stored in the bucket and every object-request in s3 costs. More info: https://docs.aws.amazon.com/AmazonS3/latest/user-guide/using-folders.html.
Here's one option. Consider writing a Lambda function that either runs on a schedule (say if you knew that your uploads always happen between 1pm and 4pm) or is triggered by an S3 object upload (so it happens for every object uploaded to folder1/morefolders/).
The Lambda would write the relevant part(s) of the S3 object prefix into a simple DynamoDB table. The client that needs to know the latest prefix would read it from DynamoDB.

DynamoDB vs ElasticSearch vs S3 - which service to use for superfast get/put 10-20MB files?

I have backend that recieves, stores and serves 10-20 MB json files. Which service should I use for superfast put and get (I cannot break the file in smaller chunks)? I dont have to run queries on these files just get them, store them and supply them instantly. The service should scale to tens of thousands of files easily. Ideally I should be able to put the file in 1-2 seconds and retrieve it in the same time.
I feel s3 is the best option and elastic search the second best option. Dyanmodb doesnt allow such object size. What should I use? Also, is there any other service? Mongodb is a possible solution but i dont see that on AWS, so something quick to setup would be great.
Thanks
I don't think you should go for Dynamo or ES for this kind of operation.
After all, what you want is to store and serve it, not going into the file's content which both Dynamo and ES would waste time to do.
My suggestion is to use AWS Lambda + S3 to optimize for cost
S3 does have some small downtime after putting till the file is available though ( It get bigger, minutes even, when you have millions of object in a bucket )
If downtime is important for your operation and total throughput at any given moment is not too huge, You can create a server ( preferably EC2) that serves as a temporary file stash. It will
Receive your file
Try to upload it to S3
If the file is requested before it's available on S3, serve the file on disk
If the file is successfully uploaded to S3, serve the S3 url, delete the file on disk

How to automatically edit over 100k files on GCS using Dataflow?

I have over 100 thousand files on Google Cloud Storage that contain JSON objects and I'd like to create a mirror maintaining the filesytem structure, but with some fields removed from the content of files.
I tried to use Apache Beam on Google Cloud Dataflow, but it splits all files and I can't maintain the structure anymore. I'm using TextIO.
The structure I have is something like reports/YYYY/MM/DD/<filename>
But Dataflow outputs to output_dir/records-*-of-*.
How can I make Dataflow not split the files and output them with the same directory and file structure?
Alternatively, is there a better system to do this kind of edits on a large number of files?
You can not directly use TextIO for this, but Beam 2.2.0 will include a feature that will help you write this pipeline yourself.
If you can build a snapshot of Beam at HEAD, you can already use this feature. Note: the API may change slightly between the time of writing this answer and the release of Beam 2.2.0
Use Match.filepatterns() to create a PCollection<Metadata> of files matching the filepattern
Map the PCollection<Metadata> with a ParDo that does what you want to each file using FileSystems:
Use the FileSystems.open() API to read the input file and then standard Java utilities for working with ReadableByteChannel.
Use FileSystems.create() API to write the output file.
Note that Match is a very simple PTransform (that uses FileSystems under the hood) and another way you can use it in your project is by just copy-pasting (the necessary parts of) its code into your project, or studying its code and reimplementing something similar. This can be an option in case you're hesitant to update your Beam SDK version.