How to use wildcards in filename in AzureDataFactoryV2 to copy only specific files from a container? - copy

So I have a pipeline in AzureDataFactoryV2, in that pipeline I have defined a copyActivity to copy files from blob storage to Azure DataLake Store. But I want to copy all the files except the files that have "-2192-" string in them.
So If I have these files:
213-1000-aasd.csv
343-2000-aasd.csv
343-3000-aasd.csv
213-2192-aasd.csv
I want to copy all using copyactivity but not 213-2192-aasd.csv. I have tried using different regular expression in wildcard option but no success.
According to my knowledge regular expression should be:
[^-2192-].csv
But it gives errors on this.
Thanks.

I don't know whether the data factory expression language supports Regex. Assuming it does not, the Wildcard is probably positive matching only, so Wildcard to exclude specific patterns seems unlikely.
What you could do is use 1) Get Metadata to get the list of objects in the blob folder, then 2) a Filter where the item().type is 'File' and index of '-2192-' in the file name is < 0 [the indexes are 0-based], and finally 3) a ForEach over the Filter that contains the Copy activity.

Related

How set name file for GCS Move using Arguments runtime with wildcards in CDAP/Data fusion pipeline

I need to move some files between buckets after they are processed in the pipeline, however I have come across files that contain characters like "+" or "-" within their name (example: data+1+2132121.json)
following the documentation in this How to set runtime arguments in a CDAP/DATA FUSION pipeline? response
I have set the source path to "gs://mybucket/data/${input}" so that the value of my argument can be "data+1+" and only move those files that begin with "data+1+" are the ones that will be moved from location.however when using wildcards() I cannot move the files.some idea that it may be happening?

How to avoid copying data from subfolders using COPY INTO in Snowflake

We are trying to load data from an S3 bucket into Snowflake using COPY INTO.
Works perfectly.. But data in subfolders are also being copied, and this shoud not not happen.
Following hardcoded pattern REGEX works perfectly
copy into TARGETTABLE
from #SOURCESTAGE
pattern='^(?!.*subfolder/).*$'
But we don't want to hardcode the foldername. When I just keep the '/' it doesn't work anymore.. ( same happens when I escape the slash \/ )
copy into TARGETTABLE
from #SOURCESTAGE
pattern='^(?!.*/).*$'
Does anybody knows which REGEX to use to skip any subfolder in the COP INTO in a dynamic way? (no hardcoding of folder name )
#test_stage/folder_include
#test_stage/folder_include/file_that_has_to_be_loaded.csv
#test_stage/folder_include/folder_exclude/file_that_cannot_be_loaded.csv
So only files within folder_include can be picked up by the copy into statement. Everything in a lower level needs to be skipped.
Most importantly: without hardcoding on foldername. Any folder within folder_include has to be ignored.
Thanks!
Here (like mentioned in the comments) is a solution for skipping a hardcoded foldername: How to avoid sub folders in snowflake copy statement
In my opinion replacing the hardcoded-part with .* makes it generic.
Kind regards :)
If the PATH that's included in STAGE is static, you can include that in your pattern.
list #SOURCESTAGE PATTERN = 'full_path_to_folder_include/[^/]*'
Even if your path include environment specific folder (for eg. DEV, PROD), you can account for that:
list #SOURCESTAGE PATTERN = 'static_path/[^/]+/path_to_folder/[^/]*'
or
list #SOURCESTAGE PATTERN = 'static_path/(dev|test|prod)/path_to_folder/[^/]*'

How does regex_path_filter work in GCSFile properties of DATA FUSION pipeline in GCP

IN Data fusion pipeline of GCP, the GCSFile properties having a field named "Regex path filter". How does it work?. I don't get proper documentation on this.
You can find the regex documentation here.
How does it work? It is applied to the filenames and not to the whole path.
For example, lets say you have the following path: gs://<my-bucket>/<my/complete/path>/ and you have some CSV and JSON files inside this path.
To filter only the CSV files you would use the regex .*\.csv
Please note that this regex will only filter what starts after your path.

TextIO. Read multiple files from GCS using pattern {}

I tried using the following
TextIO.Read.from("gs://xyz.abc/xxx_{2017-06-06,2017-06-06}.csv")
That pattern didn't work, as I get
java.lang.IllegalStateException: Unable to find any files matching StaticValueProvider{value=gs://xyz.abc/xxx_{2017-06-06,2017-06-06}.csv}
Even though those 2 files do exist. And I tried with a local file using a similar expression
TextIO.Read.from("somefolder/xxx_{2017-06-06,2017-06-06}.csv")
And that did work just fine.
I would've thought there would be support for all kinds of globs for files in GCS, but nope. Why is that? is there away to accomplish what I'm looking for?
This may be another option, in addition to Scott's suggestion and your comment on his answer:
You can define a list with the paths you want to read and then iterate over it, creating a number of PCollections in the usual way:
PCollection<String> events1 = p.apply(TextIO.Read.from(path1));
PCollection<String> events2 = p.apply(TextIO.Read.from(path2));
Then create a PCollectionList:
PCollectionList<String> eventsList = PCollectionList.of(events1).and(events2);
And then flatten this list into your PCollection for your main input:
PCollection<String> events = eventsList.apply(Flatten.pCollections());
Glob patterns work slightly differently in Google Cloud Storage vs. the local filesystem. Apache Beam's TextIO.Read transform will defer to the underlying filesystem to interpret the glob.
GCS glob wildcard patterns are documented here (Wildcard Names).
In the case above, you could use:
TextIO.Read.from("gs://xyz.abc/xxx_2017-06-*.csv")
Note however that this will also include any other matching files.
Did you try Apache Beam TextIO.Read from function? Here, it says that it is possible with gcs as well:
public TextIO.Read from(java.lang.String filepattern)
Reads text files that reads from the file(s) with the given filename or filename pattern. This can be a local path (if running locally), or a Google Cloud Storage filename or filename pattern of the form "gs://<bucket>/<filepath>" (if running locally or using remote execution service).
Standard Java Filesystem glob patterns ("*", "?", "[..]") are supported.

Copy multiple .gz files from one GCS bucket to another bucket in Java

I want to copy multiple .gz files from one gcs bucket to another. File name pattern has prefix as 'Logs_' and suffix as date like '20160909',so full file name will be Logs_2016090.gz, Logs_20160908.gz etc. I want to copy all files starting with Logs_ from one gcs bucket to another gcs bucket. For this I am using wildcard character * at the end like Logs_*.gz for copy operation as below:
Storage.Objects.Copy request =
storageService
.objects()
.copy("source_bucket", "Logs_*.gz", "destination_bucket", ".", content);
Above I am using "." because all files has to be copied to destination_bucket, so I can't specify single file name there. Unfortunately, this code doesn't work and error that file doesn't exist. I am not sure what change is required here. Any java link or any piece of code will be helpful. Thanks !!
While the gsutil command-line utility happily supports wildcards, the GCS APIs themselves are lower level commands and do not. The storage.objects.copy method must have one precise source and one precise destination.
I recommend one of the following:
Use a small script invoking gsutil, or
Make a storage.objects.list call to get the names of all matching source objects, then iterate over them, calling copy for each, or
If you're dealing with more than, say, 10 TB or so of gzip files, consider using Google's Cloud Storage Transfer Service to copy the files.