How to avoid copying data from subfolders using COPY INTO in Snowflake

How to avoid copying data from subfolders using COPY INTO in Snowflake - copy

We are trying to load data from an S3 bucket into Snowflake using COPY INTO.
Works perfectly.. But data in subfolders are also being copied, and this shoud not not happen.
Following hardcoded pattern REGEX works perfectly
copy into TARGETTABLE
from #SOURCESTAGE
pattern='^(?!.*subfolder/).*$'
But we don't want to hardcode the foldername. When I just keep the '/' it doesn't work anymore.. ( same happens when I escape the slash \/ )
copy into TARGETTABLE
from #SOURCESTAGE
pattern='^(?!.*/).*$'
Does anybody knows which REGEX to use to skip any subfolder in the COP INTO in a dynamic way? (no hardcoding of folder name )
#test_stage/folder_include
#test_stage/folder_include/file_that_has_to_be_loaded.csv
#test_stage/folder_include/folder_exclude/file_that_cannot_be_loaded.csv
So only files within folder_include can be picked up by the copy into statement. Everything in a lower level needs to be skipped.
Most importantly: without hardcoding on foldername. Any folder within folder_include has to be ignored.
Thanks!

Here (like mentioned in the comments) is a solution for skipping a hardcoded foldername: How to avoid sub folders in snowflake copy statement
In my opinion replacing the hardcoded-part with .* makes it generic.
Kind regards :)

If the PATH that's included in STAGE is static, you can include that in your pattern.
list #SOURCESTAGE PATTERN = 'full_path_to_folder_include/[^/]*'
Even if your path include environment specific folder (for eg. DEV, PROD), you can account for that:
list #SOURCESTAGE PATTERN = 'static_path/[^/]+/path_to_folder/[^/]*'
or
list #SOURCESTAGE PATTERN = 'static_path/(dev|test|prod)/path_to_folder/[^/]*'

Related

How to use wildcards in filename in AzureDataFactoryV2 to copy only specific files from a container?

So I have a pipeline in AzureDataFactoryV2, in that pipeline I have defined a copyActivity to copy files from blob storage to Azure DataLake Store. But I want to copy all the files except the files that have "-2192-" string in them.
So If I have these files:
213-1000-aasd.csv
343-2000-aasd.csv
343-3000-aasd.csv
213-2192-aasd.csv
I want to copy all using copyactivity but not 213-2192-aasd.csv. I have tried using different regular expression in wildcard option but no success.
According to my knowledge regular expression should be:
[^-2192-].csv
But it gives errors on this.
Thanks.

I don't know whether the data factory expression language supports Regex. Assuming it does not, the Wildcard is probably positive matching only, so Wildcard to exclude specific patterns seems unlikely.
What you could do is use 1) Get Metadata to get the list of objects in the blob folder, then 2) a Filter where the item().type is 'File' and index of '-2192-' in the file name is < 0 [the indexes are 0-based], and finally 3) a ForEach over the Filter that contains the Copy activity.

Rename specific a component of file names using a list

I have a number of files that look like this:
imgDATA_subj001_log000_sess001_at.img
imgDATA_subj001_log000_sess001_cn.img
imgDATA_subj001_log000_sess001_cx.img
imgDATA_subj001_log000_sess002_at.img
imgDATA_subj001_log000_sess002_cn.img
imgDATA_subj001_log000_sess002_cx.img
imgDATA_subj002_log000_sess001_at.img
...
I want to rename a specific numeric part of the file name after subj . For instance, subj001, subj002, subj003, etc. would be renamed to subj014, subj027,subj65, etc. but preserve the rest of the file name. I have the list of new names but not sure how to look for old names and match with the new names then do the renaming. I tried using loops and fileparts but I don't know how to isolate the subj*** component. I could do move but that would be very inefficient. Can anyone help?

If you know the portions of the filename that you want to replace specifically, ie you know that subj001 = subj014 then what you should do is use a dir command to get the list of files in the existing directory.
This will give you a list of the files,
f=dir(imgData*.img)
for somecounter=1:length(f)
filename=f(somecounter).name
newname=strrep(filename,'subj001','subj014')
movefile(filename,newname)
end
Obviously you'll want to setup an array of each of the individual names to match up and iterate through that.

Rake FileList exclude method works in irb but not my rakefile

Goal : Collect all files, complete with directory structure, matching a directory structure.
Wrinkle: Need to filter out a pesky undesired directory that matches but is thankfully uniquely named 'do-not-want'. Actual string changed to protect the innocent.
source/dir1/content/scripts - ok
source/dir2/subdir1/content/scripts - ok
source/dir3/do-not-want/content/scripts - well... do not want
The script below works but I have to do a separate check for the undesired path which should not be necessary. When I test this same FileList in irb with the exclude it works as desired. From my rakefile I see the do-not-want directories being returned by the FileList.
FileList['source/**/content/scripts'].exclude('do-not-want').each do |f|
unless /do-not-want/ =~ f #hmm why does the exclude above not actually exclude do-not-want directories?
Dir.chdir(f) do |d|
puts "directory changed to #{d} and copying scripts from #{d} to common directory #{target}"
FileUtils.cp_r('.', target)
end
end
end
Surely I am doing something dumb.
Bonus points: if you help me learn rake/ruby and show me a better way to accomplish same goal while defeating the wrinkle.

I think you should use FileList['source/**/content/scripts'].exclude(/\/do-no-want\//) to filter out paths which contain "/do-no-want/" substring.
Try adding output to your rake file, so you can see and debug what's going on there.

Delete multiple files with names containing a substring efficiently

I would like to delete multiple files that contain a substring. Say for example I would like to delete all the files that has the substring my. Assume that my directory contains 4 files: photo.jpg, myPhoto.jpg, beachMyPhoto.jpg, anyPhoto.jpg, since the term of search is my the files that I am interested to delete are myPhoto.jpg and beachMyPhoto.jpg (case insensitive).
My proposed solution (which I know how to do) is to use NSFileManager class, and use the function contentsOfDirectoryAtPath:error: to read all the directory contents, and then search by a loop for a hit. If a hit is found I delete that file.
What I don't like in my proposed solution is that it is not that efficient especially if the directory contains too many files and the hit is a small number. Is there a more efficient way to do this?

If you don't want a big array loaded into memory, you can try -[NSFileManager enumeratorAtURL:includingPropertiesForKeys:options:errorHandler:]. Since you only want the immediate contents of the directory, you would invoke -[NSDirectoryEnumerator skipDescendants] for each directory that it returns.
If your concern is iterating over all of the items in the directory, testing for your match pattern, well that's unavoidable. Any technique you would hope to use has to somehow iterate over all of the items in the directory and test for a match. The only question is whether that iteration is exposed to you or not. In Cocoa, it is. You could drop down to the glob() function if you want an alternative where it isn't.

Difference in the paths in .gitignore file?

I've been using git but still having confusion about the .gitignore file paths.
So, what is the difference between the following two paths in .gitignore file?
tmp/*
public/documents/**/*
I can understand that tmp/* will ignore all the files and folders inside it. Am I right?
But what does that second line path mean?

This depends on the behavior of your shell. Git doesn't do any work to determine how to expand these. In general, * matches any single file or folder:
/a/*/z
matches /a/b/z
matches /a/c/z
doesn't match /a/b/c/z
** matches any string of folders:
/a/**/z
matches /a/b/z
matches /a/b/c/z
matches /a/b/c/d/e/f/g/h/i/z
doesn't match /a/b/c/z/d.pr0n
Combine ** with * to match files in an entire folder tree:
/a/**/z/*.pr0n
matches /a/b/c/z/d.pr0n
matches /a/b/z/foo.pr0n
doesn't match /a/b/z/bar.txt

Update (08-Mar-2016)
Today, I am unable to find a machine where ** does not work as claimed. That includes OSX-10.11.3 (El Capitan) and Ubuntu-14.04.1 (Trusty). Possibly git-ignore as been updated, or possibly recent fnmatch handles ** as people expect. So the accepted answer now seems to be correct in practice.
Original post
The ** has no special meaning in git. It is a feature of bash >= 4.0, via
shopt -s globstar
But git does not use bash. To see what git actually does, you can experiment with git add -nv and files in several levels of sub-directories.
For the OP, I've tried every combination I can think of for the .gitignore file, and nothing works any better than this:
public/documents/
The following does not do what everyone seems to think:
public/documents/**/*.obj
I cannot get that to work no matter what I try, but at least that is consistent with the git docs. I suspect that when people add that to .gitignore, it works by accident, only because their .obj files are precisely one sub-directory deep. They probably copied the double-asterisk from a bash script. But perhaps there are systems where fnmatch(3) can handle the double-asterisk as bash can.

If you're using a shell such as Bash 4, then ** is essentially a recursive version of *, which will match any number of subdirectories.
This makes more sense if you add a file extension to your examples. To match log files immediately inside tmp, you would type:
/tmp/*.log
To match log files anywhere in any subdirectory of tmp, you would type:
/tmp/**/*.log
But testing with git version 1.6.0.4 and bash version 3.2.17(1)-release, it appears that git does not support ** globs at all. The most recent man page for gitignore doesn't mention **, either, so this is either (1) very new, (2) unsupported, or (3) somehow dependent on your system's implementation of globbing.
Also, there's something subtle going on in your examples. This expression:
tmp/*
...actually means "ignore any file inside a tmp directory, anywhere in the source tree, but don't ignore the tmp directories themselves". Under normal circumstances, you'd probably just write:
/tmp
...which would ignore a single top-level tmp directory. If you do need to keep the tmp directories around, while ignoring their contents, you should place an empty .gitignore file in each tmp directory to make sure that git actually creates the directory.

Note that the '**', when combined with a sub-directory (**/bar), must have changed from its default behavior, since the release note for git1.8.2 now mentions:
The patterns in .gitignore and .gitattributes files can have **/, as a pattern that matches 0 or more levels of subdirectory.
E.g. "foo/**/bar" matches "bar" in "foo" itself or in a subdirectory of "foo".
See commit 4c251e5cb5c245ee3bb98c7cedbe944df93e45f4:
"foo/**/bar" matches "foo/x/bar", "foo/x/y/bar"... but not "foo/bar".
We make a special case, when foo/**/ is detected (and "foo/" part is already matched), try matching "bar" with the rest of the string.
"Match one or more directories" semantics can be easily achieved using "foo/*/**/bar".
This also makes "**/foo" match "foo" in addition to "x/foo", "x/y/foo"..
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds#gmail.com>
Simon Buchan also commented:
current docs (.gitignore man page) are pretty clear that no subdirectory is needed, x/** matches all files under (possibly empty) x
The .gitignore man page does mention:
A trailing "/**" matches everything inside. For example, "abc/**" matches all files inside directory "abc", relative to the location of the .gitignore file, with infinite depth.
A slash followed by two consecutive asterisks then a slash matches zero or more directories. For example, "a/**/b" matches "a/b", "a/x/b", "a/x/y/b" and so on.

When ** isn't supported, the "/" is essentially a terminating character for the wildcard, so when you have something like:
public/documents/**/*
it is essentially looking for two wildcard items in between the slashes and does not pick up the slashes themselves. Consequently, this would be the same as:
public/documents/*/*

It doesn't work for me but you could create a new .gitignore in that subdirectory:
tmp/**/*.log
can be replaced by a .gitignore in tmp:
*.log

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to avoid copying data from subfolders using COPY INTO in Snowflake - copy

Here (like mentioned in the comments) is a solution for skipping a hardcoded foldername: How to avoid sub folders in snowflake copy statement In my opinion replacing the hardcoded-part with .* makes it generic. Kind regards :)

Related

How to use wildcards in filename in AzureDataFactoryV2 to copy only specific files from a container?

Rename specific a component of file names using a list

Rake FileList exclude method works in irb but not my rakefile

Delete multiple files with names containing a substring efficiently

Difference in the paths in .gitignore file?

Categories

Resources