Copy multiple .gz files from one GCS bucket to another bucket in Java - google-cloud-storage

I want to copy multiple .gz files from one gcs bucket to another. File name pattern has prefix as 'Logs_' and suffix as date like '20160909',so full file name will be Logs_2016090.gz, Logs_20160908.gz etc. I want to copy all files starting with Logs_ from one gcs bucket to another gcs bucket. For this I am using wildcard character * at the end like Logs_*.gz for copy operation as below:
Storage.Objects.Copy request =
storageService
.objects()
.copy("source_bucket", "Logs_*.gz", "destination_bucket", ".", content);
Above I am using "." because all files has to be copied to destination_bucket, so I can't specify single file name there. Unfortunately, this code doesn't work and error that file doesn't exist. I am not sure what change is required here. Any java link or any piece of code will be helpful. Thanks !!

While the gsutil command-line utility happily supports wildcards, the GCS APIs themselves are lower level commands and do not. The storage.objects.copy method must have one precise source and one precise destination.
I recommend one of the following:
Use a small script invoking gsutil, or
Make a storage.objects.list call to get the names of all matching source objects, then iterate over them, calling copy for each, or
If you're dealing with more than, say, 10 TB or so of gzip files, consider using Google's Cloud Storage Transfer Service to copy the files.

Related

how to create a script that allows to use the path list as a reference for copying files in PowerShell in .bat script

I'm looking for a way to automate archiving where after I plug my two external drives I can copy all my resources. The problem is that I have different file structures on my laptop and on both external drives so I need to select specific folders to be copied. It means that I can't select one root folder and copy it straightforward. I tried to find a way to declare more than one path in the cp command and in the copy command, without success. An example path:
/my_programming_stuff
/folder1
/folder2
/folder3
/folder4
I want to select only the first 3 folders to copy them into external drive1 and external drive 2. The idea is to create a .bat file that will copy everything at once ( in the best case scenario it will be copied simultaneously on both external drives, so it will be much faster). Another problem is that there needs to be a bypass the ntfs long path limitations (max. 260 characters).
Flags that I want to use:
Copy the files and directories and all of their attributes,
including ownerships and permissions.
Recursively copy directories and their contents.
When copying files from one directory to another, only
copy files that either doesn't exist or are newer than the
existing corresponding files, in the destination
directory.
data verification (so it's certain that the copy was verified)
progression bar with time eta
Until now I was using Total Commander to do this but every day I need to pick only a few folders to be copied which takes time and is inefficient.
I have experience with Bash and PowerShell but I am not sure how to handle this topic.
Create a static batch file with robocopy commands. I think /copyall is the only switch you need to specify for all this. Other defaults should satisfy requirements.
https://learn.microsoft.com/en-us/windows-server/administration/windows-commands/robocopy
I think your time will be better spent learning how to use either FastCopy or FreeFileSynce. I used FreeFileSync some years ago but got disgusted with the it's constantly changing format of its xml file used for starting a backup, so I switched to FastCopy. But it looks like FreeFileSync may be getting their act together and I aim to do some experiments over the summer to see if I want to switch back to it.
Both can handle the long filename format issues, both can be executed by a batch file, both seem to have a lot of quality, but FreeFileSync has more features - and more bloated because of the features. But speed wise, I think FastCopy is probably one of the better products out there and very streamline in use and design.

How to copy the files from SFTP to target folder created dynamically based on source filename (Blob storage)?

I am new to ADF, need help for 2 scenarios
1.I have to copy files from SFTP to blob storage(Azure Gnen2) using ADF. In the source SFTP folder, there are 3- 5 different the files. For example
S09353.DB2K.AFC00R46.F201130.txt
S09353.DB2K.XYZ00R46.F201130.txt
S09353.DB2K.GLY00R46.F201130.txt
On copying, this files are copied and placed under corresponding folders which are created dynamically based on file types.
For example: S09353.DB2K.AFC00R46.F201130.txt copy to AFC00R46 folder
S09353.DB2K.XYZ00R46.F201130.txt copy to XYZ00R46 folder.
2.Another requirement is need to copy csv files from blob storage to SFTP. On coping, the files need to copy to target folder created dynamically based on file name:
for example: cust-fin.csv----->copy to--------->Finance folder
please help me on this
The basic solution to your problem is to use Parameters in DataSets. This example is for a Blob Storage connection, but the approach is the same for SFTP as well. Another hint: if you are just moving files, use Binary DataSets.
Create Parameter(s) in DataSet
Reference Parameter(s) in DataSet
Supply Parameter(s) in the Pipeline
In this example, I am passing Pipeline parameters to a GetMetadata Activity, but the principles are the same for all DataSet types. The values could also be hard coded, expressions, or variables.
Your Process
If you need this to be dynamic for each file name, then you'll probably want to break this into parts:
Use a GetMetadata Activity to list the files from SFTP.
Foreach over the return list and process each file individually.
Inside the Foreach -> Parse each file name individually to extract the Folder name to a variable.
Inside the Foreach -> Use the Variable in a Copy Activity to populate the Folder name in the DataSet.

TextIO. Read multiple files from GCS using pattern {}

I tried using the following
TextIO.Read.from("gs://xyz.abc/xxx_{2017-06-06,2017-06-06}.csv")
That pattern didn't work, as I get
java.lang.IllegalStateException: Unable to find any files matching StaticValueProvider{value=gs://xyz.abc/xxx_{2017-06-06,2017-06-06}.csv}
Even though those 2 files do exist. And I tried with a local file using a similar expression
TextIO.Read.from("somefolder/xxx_{2017-06-06,2017-06-06}.csv")
And that did work just fine.
I would've thought there would be support for all kinds of globs for files in GCS, but nope. Why is that? is there away to accomplish what I'm looking for?
This may be another option, in addition to Scott's suggestion and your comment on his answer:
You can define a list with the paths you want to read and then iterate over it, creating a number of PCollections in the usual way:
PCollection<String> events1 = p.apply(TextIO.Read.from(path1));
PCollection<String> events2 = p.apply(TextIO.Read.from(path2));
Then create a PCollectionList:
PCollectionList<String> eventsList = PCollectionList.of(events1).and(events2);
And then flatten this list into your PCollection for your main input:
PCollection<String> events = eventsList.apply(Flatten.pCollections());
Glob patterns work slightly differently in Google Cloud Storage vs. the local filesystem. Apache Beam's TextIO.Read transform will defer to the underlying filesystem to interpret the glob.
GCS glob wildcard patterns are documented here (Wildcard Names).
In the case above, you could use:
TextIO.Read.from("gs://xyz.abc/xxx_2017-06-*.csv")
Note however that this will also include any other matching files.
Did you try Apache Beam TextIO.Read from function? Here, it says that it is possible with gcs as well:
public TextIO.Read from(java.lang.String filepattern)
Reads text files that reads from the file(s) with the given filename or filename pattern. This can be a local path (if running locally), or a Google Cloud Storage filename or filename pattern of the form "gs://<bucket>/<filepath>" (if running locally or using remote execution service).
Standard Java Filesystem glob patterns ("*", "?", "[..]") are supported.

Get an WARC achive file with all files from a given domain, using from commoncrawl.org

Commoncrawl datasets are splitted by segments.
How to extract a subset of the common-crawl data-set? I need a WARC archive file (or several archive files) with all the files from a given domain, such as example.com?
Note: common_crawl_index allows to do that by running bin/remote_copy copy "com.ipc.www" --bucket commoncrawl_sample --key common_crawl/ipc_crawl, but the project is outdated: it only works for 2012 datasets, and it does not accept WARC, WAT or WET files.
Note: Also, http://index.commoncrawl.org/ allows to find the segments for a given url prefix, but there is not a utility to download only that pages, such as the previous remote_copy command.
PS: I am aware I can implement a program to do so. Here I am asking if common-crawl (or someone else) already thought and implemented this feature.

Talend writing file names from a list to a file for process completion

My job has following steps:
- Connect to ftp location
- Download compressed files
- Uncompress files to different folder
- Delete compressed files
- Write file names to a tracking file
ftpConnection -OnComponentOk--> ftpList-Iterate--> ftpGet -Iterate--> fileList-Iterate--> fileUnarchive-Iterate--> fileDelete
Question is where can i write the uncompressed filenames to the tracking file. When i try to Iterate from fileUnarchive to fileOutputDelimited it does not
allow me, similarly if i want to add a map from fileDelete it does not allow me. Do i need a map or can i make use of the global variable somehow?
One way i can do it getting it after ftpGet but i would prefer to do it at a latter stage (after unarchiving or deletion) so i don't update the file if the
process fails at one of these steps.
Thanks.
try with tfiledelete-->oncomponentok-->tfixedflowinput(here you can use the same global variable which contains current file name from tfilelist)-->(mainflow)-=->tfileoutputdelimeted...