TextIO. Read multiple files from GCS using pattern {} - google-cloud-storage

I tried using the following
TextIO.Read.from("gs://xyz.abc/xxx_{2017-06-06,2017-06-06}.csv")
That pattern didn't work, as I get
java.lang.IllegalStateException: Unable to find any files matching StaticValueProvider{value=gs://xyz.abc/xxx_{2017-06-06,2017-06-06}.csv}
Even though those 2 files do exist. And I tried with a local file using a similar expression
TextIO.Read.from("somefolder/xxx_{2017-06-06,2017-06-06}.csv")
And that did work just fine.
I would've thought there would be support for all kinds of globs for files in GCS, but nope. Why is that? is there away to accomplish what I'm looking for?

This may be another option, in addition to Scott's suggestion and your comment on his answer:
You can define a list with the paths you want to read and then iterate over it, creating a number of PCollections in the usual way:
PCollection<String> events1 = p.apply(TextIO.Read.from(path1));
PCollection<String> events2 = p.apply(TextIO.Read.from(path2));
Then create a PCollectionList:
PCollectionList<String> eventsList = PCollectionList.of(events1).and(events2);
And then flatten this list into your PCollection for your main input:
PCollection<String> events = eventsList.apply(Flatten.pCollections());

Glob patterns work slightly differently in Google Cloud Storage vs. the local filesystem. Apache Beam's TextIO.Read transform will defer to the underlying filesystem to interpret the glob.
GCS glob wildcard patterns are documented here (Wildcard Names).
In the case above, you could use:
TextIO.Read.from("gs://xyz.abc/xxx_2017-06-*.csv")
Note however that this will also include any other matching files.

Did you try Apache Beam TextIO.Read from function? Here, it says that it is possible with gcs as well:
public TextIO.Read from(java.lang.String filepattern)
Reads text files that reads from the file(s) with the given filename or filename pattern. This can be a local path (if running locally), or a Google Cloud Storage filename or filename pattern of the form "gs://<bucket>/<filepath>" (if running locally or using remote execution service).
Standard Java Filesystem glob patterns ("*", "?", "[..]") are supported.

Related

What is the usage of blacklist.txt in pythonforandroid (p4a)?

In the documentation of pythonforandroid, at https://python-for-android.readthedocs.io/en/latest/buildoptions/, there is a build option described called blacklist.
--blacklist: The path to a file containing blacklisted patterns that will be excluded from the final APK. Defaults to ./blacklist.txt
However, not a word can be found anywhere about how to use this file and what exactly the patterns are supposed to represent. For instance, is this used to exclude libraries, files, or directories? Do the patterns match file names or contents? What is the syntax of the patterns, or an example of a valid blacklist.txt file?
This file should contain a list of glob patterns, i.e. as implemented by fnmatch, one per line. These patterns are compared against the full filepath of each file in your source dir, probably using a global filepath but I'm not certain about that (it might be relative to the source dir).
For instance, the file could contain the following lines:
*.txt
*/test.jpg
This would prevent all files ending with .txt from being included in the apk, and all files named test.jpg in any subfolder.
If using buildozer, the android.blacklist_src buildozer.spec option can be used to point to your choice of blacklist file.

How to use wildcards in filename in AzureDataFactoryV2 to copy only specific files from a container?

So I have a pipeline in AzureDataFactoryV2, in that pipeline I have defined a copyActivity to copy files from blob storage to Azure DataLake Store. But I want to copy all the files except the files that have "-2192-" string in them.
So If I have these files:
213-1000-aasd.csv
343-2000-aasd.csv
343-3000-aasd.csv
213-2192-aasd.csv
I want to copy all using copyactivity but not 213-2192-aasd.csv. I have tried using different regular expression in wildcard option but no success.
According to my knowledge regular expression should be:
[^-2192-].csv
But it gives errors on this.
Thanks.
I don't know whether the data factory expression language supports Regex. Assuming it does not, the Wildcard is probably positive matching only, so Wildcard to exclude specific patterns seems unlikely.
What you could do is use 1) Get Metadata to get the list of objects in the blob folder, then 2) a Filter where the item().type is 'File' and index of '-2192-' in the file name is < 0 [the indexes are 0-based], and finally 3) a ForEach over the Filter that contains the Copy activity.

Copy multiple .gz files from one GCS bucket to another bucket in Java

I want to copy multiple .gz files from one gcs bucket to another. File name pattern has prefix as 'Logs_' and suffix as date like '20160909',so full file name will be Logs_2016090.gz, Logs_20160908.gz etc. I want to copy all files starting with Logs_ from one gcs bucket to another gcs bucket. For this I am using wildcard character * at the end like Logs_*.gz for copy operation as below:
Storage.Objects.Copy request =
storageService
.objects()
.copy("source_bucket", "Logs_*.gz", "destination_bucket", ".", content);
Above I am using "." because all files has to be copied to destination_bucket, so I can't specify single file name there. Unfortunately, this code doesn't work and error that file doesn't exist. I am not sure what change is required here. Any java link or any piece of code will be helpful. Thanks !!
While the gsutil command-line utility happily supports wildcards, the GCS APIs themselves are lower level commands and do not. The storage.objects.copy method must have one precise source and one precise destination.
I recommend one of the following:
Use a small script invoking gsutil, or
Make a storage.objects.list call to get the names of all matching source objects, then iterate over them, calling copy for each, or
If you're dealing with more than, say, 10 TB or so of gzip files, consider using Google's Cloud Storage Transfer Service to copy the files.

Finding individual filenames when loading multiple files in Apache Spark

I have an Apache Spark job that loads multiple files for processing using
val inputFile = sc.textFile(inputPath)
This is working fine. However for auditing purposes it would be useful to track which line came from which file when the inputPath is a wildcard. Something like an RDD[(String, String)] where the first string is the line of input text and the second is the filename.
Specifically, I'm using Google's Dataproc and the file(s) are located in Google Cloud Storage, so the paths are similar to 'gs://my_data/*.txt'.
Check out SparkContext#wholeTextFiles.
Sometimes if you use many input paths, you may find yourself wanting to use worker resources for this file listing. To do that, you can use the Scala parts of this answer on improving performance of wholeTextFiles in pyspark (ignore all the python bits, the scala parts are what matter).

Why does tup need one Tupfile per directory?

I've read a lot about the tup build system.
In many places, it is said that tup "does not support recursive rules" and that you need to have one Tupfile per directory. Yet I have not seen an official statement or explanation.
Is the above claim correct?
If yes, why, and for which kind of task is this problematic? An example would be nice.
It is worth noting that currently a Tupfile can create files in a different directory. You could always read files from different directory, so currently you can have a single Tupfile for the whole project.
Some more info here: https://groups.google.com/d/msg/tup-users/h7B1YzdgCag/qsOpBs4DIQ8J (a little outdated) + https://groups.google.com/d/msg/tup-users/-w932WkPBkw/7ckmHJ9WUCEJ (new syntax to use the group as input)
If you use the new LUA parser you can also have a "default" Tupfile - see here http://gittup.org/tup/lua_parser.html and check info about Tupdefault.lua
Some of the answers have mentioned already that the limitation really is one Tupfile per directory where you want output files, instead of one Tupfile per directory. In a recent commit, this limitation has been relaxed and tup allows you to place output files also in subdirectories of the Tupfile.
In addition, with variants, it is possible to generate output files anywhere in the build tree.
The official statement can be found in tup manual: http://gittup.org/tup/manual.html
You must create a file called "Tupfile" anywhere in the tup hierarchy
that you want to create an output file based on the input files. The
input files can be anywhere else in the tup hierarchy, but the output
file(s) must be written in the same directory as the Tupfile.
(Quotation is the 1st paragraph from the section TUPFILES in the manual)
AFAIK, it is a limitation which has something to do with the way how tup stores dependencies in the .tup subdir but I don't know details.