Pull from and Push to S3 using Perl - perl

everyone! I have what I assume to be a simple problem, but I could use a hand digging in. I have a server that preprocesses data before translation. This is done by a series of perl scripts developed over a decade ago (but they work!). This virtual server is being lifted into AWS. The change this makes for my scripts is that the location they pull from and the location they write to will be S3 buckets now.
The work flow is: copy all files in the source location to the local drive, preprocess the data file by file, and when complete move the preprocessed files to a final destination.
process_file ($workingDir, $dirEntry);
final_move;
move("$downloadDir/$dirEntry", "$archiveDir") or die "ERROR: Archive file $downloadDir/$dirEntry -> $archiveDir FAILED $!\n";
unlink("$workingDir/$dirEntry");
So, in this case $dir and $archiveDir are S3 buckets.
Any advice on adapting this is appreciated.
TIA,
VtR

You have a few options.
Use a system like s3fs-fuse to mount your S3 bucket as a local drive. This would presumably require the smallest changes to your existing code.
Use the AWS Command Line Interface to copy your files to your S3 bucket.
Use the Amazon API (through something like Paws) to upload your files to S3.

Related

Use s3fs only to upload new files, don't care about existing ones already on bucket

I was hoping to use s3fs to upload new files into S3. On the documentation I saw that it doesn't work well when there are multiple clients uploading/syncing to the same bucket.
I really don't care about syncing files from to bucket to my local drive, I only want to perform the opposite: only upload to s3 new files as they are created.
Is there a way to achieve that with s3fs? It wasn't clear on the docs if they offer that functionality by the usage of flags.
s3fs does not synchronize files. Instead it intercepts the open, read, write, etc. calls and relays them to the S3 server. Thus it will work for your upload-only use case. Note that s3fs does use some temporary storage to stage the upload.

Using PowerShell to upload to AWS S3

Hopefully this is a quick fix (most likely user error) I am using PowerShell to upload to AWS S3, I'm attempting to copy x amount of .mp4s from a folder to an S3 location, I'm able to copy individual files successfully using the below command:
aws s3 cp .\video1.mp4 s3://bucketname/root/source/
But when I try to copy all the files within that directory I get an error:
aws s3 cp F:\folder1\folder2\folder3\folder4\* s3://bucketname/root/source/
The user-provided path F:\folder1\folder2\folder3\folder4\* does not exist.
I've tried multiple variations on the above, no path just *, *.mp4, .*.mp4 (coming from a Linux background, using quotation marks etc) but I can't seem to get it working.
I was using this documentation initially https://www.tutorialspoint.com/how-to-copy-folder-contents-in-powershell-with-recurse-parameter I feel the answer is probably very simple but couldn't see what I was doing wrong.
Any help would be appreciated.
Thanks.

Access Bucket File from Perl in Google Cloud Shell

How does one open file in a Bucket from Perl program running in a Google Cloud Shell running in the same project?
One can upload a file into the shell file system and open it and also put the file in a bucket for access by others but that seems counter-productive never mind that the files will be out of sync a day later.
I've tried various forms of
open($fh, '<', "gs://bucketname/filename");
without any luck.
Mount the bucket into the file system with fuse.

Talend: Using tfilelist to access files from a shared network path

I have a Talend job that searches a directory and then uploads it to our database.
It's something like this: dbconnection>twaitforfile>tfilelist>fileschema>tmap>db
I have a subjobok that then commits the data into the table iterates through the directory and movies files to another folder.
Recently I was instructed to change the directory to a shared network path using the same components as before (I originally thought of changing components to tftpfilelist, etc.)
My question being how to direct it to the shared network path. I was able to get it to go through using double \ but it won't read any of the new files arriving.
Thanks!
I suppose if you use tWaitForFile on the local filesystem Talend/Java will hook somehow into the folder and get a message if a new file is being put into it.
Now, since you are on a network drive first of all this is out of reach of the component. Second, the OS behind the network drive could be different.
I understand your job is running all the time, listening. You could change the behaviour to putting a tLoop first which would check the file system for new files and then proceed. There must be some delta check in how the new files get recognized.

Batch file uploading to cloud storage

Could anyone cut and paste a working request to upload several files to cloud storage in a batch. I am really struggling to get it working, there are no examples of file uploads and I'm really stuck. Could probably work it out if I had a working starting point. I'm starting to go crazy so any help would be much appreciated.
You can find an example at [1] and consult this other answer at [2] as reference.
I would suggest you to use gsutil to copy files even as an external call from your application (PHP exec() or system()) since this tool is optimised for parallel file transfer (-m option) and recursive folder copy (-R option) making it very simple and efficient.
For more help on gsutil copy command : gsutil cp help
Links:
[1] - https://cloud.google.com/storage/docs/json_api/v1/how-tos/batch#example
[2] - Batch upload requests to Google Cloud Storage using javascript
Regards
Paolo