Select files using include and exclude array of recursive glob patterns - powershell

I've been given two file glob parameters in JSON, include and exclude, in the following format:
{
include: ['**/*.md', '**/swagger/*.json', '**/*.yml', 'somedir/*.yml'],
exclude: ['**/obj/**', 'otherdir/**', '**/includes/**']
}
I'm tasked with walking a directory tree to select files according to the include and exclude rules in these formats; this has to be written as a Powershell script.
I've been trying to find a built-in command that supports the double-asterisk, recursive file glob pattern; additionally, since Powershell is converting the JSON to an object, it would be nice if the command parameters could accept an array as input.
I've looked at Get-ChildItem, but I'm not sure that I can mimic the glob resolution behavior using -include, -exclude, and/or -filter. I've also looked at Resolve-Path, but I'm not sure if the wildcards will work correctly (and I might have to manually exclude paths).
How can I select paths using multiple recursive wildcard file globs in Powershell while excluding other globs? Is there a Powershell command that supports this?
Thank you!
EDIT:
In these glob patterns, the single asterisk is a regular wildcard. The double asterisk (**), however, is a known standard which denotes a recursive directory search.
For example: the pattern dir1/*/file.txt would match:
dir1/dir2/file.txt
dir1/dir3/file.txt
...but not:
dir1/dir2/dir3/file.txt
The pattern dir1/**/file.txt would match everything that the above selector would, but it would also match:
dir1/dir3/dir4/file.txt
dir1/dir7/dir9/dir23/dir47/file.txt
and so on. So, an exclude glob pattern like **/obj/** basically means "exclude anything found in any obj folder found at any point in the directory hierarchy, no matter how deep".

Related

Databricks PySpark environment, find Azure storage account file path of files having same filename pattern

Use Case: In Databricks PySpark environment, I want to check if there are multiple files with same file name pattern existing in the Azure storage account. If they exist, I expect to get the list of file path locations for each file matched.
Tried using, dbutils.fs.ls, but it does not support the wildcard pattern. PFA.
Workaround: Get paths of all files in the folder and then loop over each file to do filename pattern matching and prepare a list of required file paths.
Do let me know, if there is any other way to get the file paths, without looping over?
In Databricks, dbutils.fs.ls() doesn’t support wildcard paths. This official documentation consists of all the Databricks utilies and there is no dbfs utility function that helps to use wildcard paths for matching file names.
You cannot proceed further without using loops. The following operations are done using a storage account with random files for demo. This demonstrates a way you can use to get the files that match your pattern.
Using os.listdir() function, you can get the list of all files in your container/directory.
path_dbfs="dbfs:/mnt/omega/" #absolute dbfs path to your storage
import os
#using os.listdir() to get all files in container.
path = "/dbfs/mnt/omega"
file_names = os.listdir(path)
print(file_names)
['country_data.csv', 'json_input.json', 'json_input.txt', 'person.csv', 'sample_1.csv', 'sample_2.csv', 'sample_3.csv', 'sample_new_date_4.csv', 'store.txt']
Once you have list of all files, you can use regular expressions with re.search() and match object property group() to check whether each file matches the pattern or not.
import re
#use regex with loops to get absolute paths of pattern matching files.
file_to_find_pattern = "sample.*csv" #match pattern in this case.
# .* indicates 0 or more occurances of other characters, you can build it according to your requirement.
matched_files = []
for file in file_names:
val = re.search(file_to_find_pattern,file)
if(val is not None):
matched_files.append(path_dbfs+val.group())
print(matched_files)
['dbfs:/mnt/omega/sample_1.csv', 'dbfs:/mnt/omega/sample_2.csv', 'dbfs:/mnt/omega/sample_3.csv', 'dbfs:/mnt/omega/sample_new_date_4.csv']

PowerShell module-qualified names and Command Prefix

I'm trying to make use of module-qualified names[1] and a DefaultCommandPrefix and not have it break if the module is imported with Import-Module -Prefix SomethingElse. Maybe I'm just doing something really wrong here, or those two features aren't meant to be used together.
Inside the main module file using "ModuleName\Verb-PrefixNoun args..." works as long as "Prefix" matches the DefaultCommandPrefix in the manifest (the module-qualified syntax seems to require the prefix used for the import[2]). But importing the module with a different prefix, all module-qualified references inside the module breaks.
After a bit of searching and trial and error, the least horrible solution I've managed to get working is to use something like the following hackish solution. But, I can't help wonder if there isn't some better way that automatically handles the prefix (just as Import-Module obviously manages to add the prefix, my first naive though was that using just ModuleName\Verb-Noun would automatically append any prefix to the noun, but evidently not[2].
So this is the hack I came up with, that looks up the modules prefix and appends it, then using "." or "&" to expand/invoke the command:
# (imagine this code in the `ModuleName.psm1`, and a manifest with some `DefaultCommandPrefix`)
Function MQ {
param (
[Parameter()][string]
$Verb,
[Parameter()][string]
$Noun,
[string]
$Module='ModuleName'
)
"$Module\$Verb-$((Get-Module $Module).Prefix)$Noun"
}
Function Verb-Noun {
# This works even when importing with a prefix,
# but can I be guaranteed that it's not some
# other module's cmdlet?
Verb-OtherNoun 1 2 3 '...'
#ModuleName\Verb-OtherNoun 1 2 3 '...'
. (MQ 'Verb' 'OtherNoun') 1 2 3 '...'
# or:
& (MQ 'Verb' 'OtherNoun') 1 2 3 '...'
}
(MQ could be made more user friendly by also accepting a single string MQ "Verb-Noun" and split/recombine automatically, and so on, etc. and all the usual disclamers)
Note: I know it would be possible to hard-code the name instead of using DefaultCommandPrefix, e.g. as PSReadLine does (and a bunch of other modules). But, to be honest that feels like a workaround.
Just calling Verb-OtherNoun seems fragile to me, as the most recent one is used[3]. So I would imagine that for example just before the call adding an Import-Module statement with a module that exports a Verb-OtherNoun would cause the wrong (not this module's) cmdlet being called. (Perhaps a more real world scenario is a module being loaded after this module has been loaded, but before calling Verb-Noun.)
Is there perhaps some syntax for module-qualifiaction I'm not aware of that would do something akin to Import-Module(e.g. Module\\Verb-Noun or Module\Verb+Noun that would resolve and inject Module's prefix) and now that I think of it, is there some reason why Module\Verb-Noun doesn't handle prefixes, or just that no one wrote the code for it? (I can't see how it would break things more than how using DefaultCommandPrefix would break v2/v3[2])
[1] https://www.sapien.com/blog/2015/10/23/using-module-qualified-cmdlet-names/
[2] https://github.com/PoshCode/PowerShellPracticeAndStyle/issues/23#issuecomment-106843619
[3] https://stackoverflow.com/a/22259706/13648152
You can avoid the problem by not using a module qualifier not using a noun prefix when you call your module's own functions.
That is, call them as Verb-Noun, exactly as named in the target function's implementation.
This is generally safe, because your own module's functions take precedence over any commands of the same name defined outside your module.
The sole exception is if an alias defined in the global scope happens to have the same name as one of your functions - but that shouldn't normally be a concern, because aliases are used for short names that do not follow the verb-noun naming convention.
It also makes sense in that it allows you to call your functions module-internally with an invariable name - a name that doesn't situationally change, depending on what the caller decided to choose as a noun prefix via Import-Module -Prefix ....
Think of the prefix feature as being a caller-only feature that is unrelated to your module's implementation.
As an aside: As of PowerShell 7.0, declaring a default noun prefix via the DefaultCommandPrefix module-manifest property doesn't properly integrate with the module auto-loading and command-discovery features - see this GitHub issue.

Doxygen EXCLUDE_PATTERNS regex

I am attempting to exclude certain files from my doxygen generated documentation. I am using version 1.8.14.
My files come in this naming convention:
/Path2/OtherFile.cs
/Path/DAL.Entity/Source.cs
/Path/DAL.Entity/SourceBase.generated.cs
I want to exclude all files that do NOT end in Base.generated.cs, and are located inside of /Path/.
Since it appears doxygen claims to use regex for the exclude_patterns variable, I eventually came up with this:
.*\\Path\\DAL\..{4,15}\\((?<!Base\.generated).)*
Needless to say, it did not work. Nor did multiple other variations. So far a simple wildcard * is the only regex character I have gotten to actually work.
doxygen uses QRegExp for a lot of things, so I assumed that was the library used for this variable as well, but even several variations of a pattern that that library claims to support did not work; granted apparently that library is full of bugs, but I would expect some things to work.
Does doxygen actually use a regex library for this variable?
If so, which library is it?
In either case, is there a method of achieving my goal?
My conclusion is; No... Doxygen Doxyfile does not support real regex. Even though they claim that it do. It's just standard wildcards that work.
We ended up with a really awkward solution to work around this.
What we did is that we added a macro in our CMakeLists.txt that creates a string with everything we want to include in INPUT instead. Manually excluding the parts we don't want.
The sad part is that CMakes regex also is crippled. So we couldn't use advanced regex such as negative lookahead in LIST(FILTER EXLUDE) similar to LIST(FILTER children EXCLUDE REGEX "^((?!autogen/public).)*$")... So even this solution is not really what we wanted.
Our CMakeLists.txt ended up looking something like this
cmake_minimum_required(VERSION 3.9)
project(documentation_html LANGUAGES CXX)
find_package(Doxygen REQUIRED dot)
# Custom macros
## Macro for getting all relevant directories when creating HTML documentain.
## This was created cause the regex matching in Doxygen and CMake are lacking support for more
## advanced syntax.
MACRO(SUBDIRS result current_dir include_regex)
FILE(GLOB_RECURSE children ${current_dir} ${current_dir}/*)
LIST(FILTER children INCLUDE REGEX "${include_regex}")
SET(dir_list "")
FOREACH(child ${children})
get_filename_component(path ${child} DIRECTORY)
IF(${path} MATCHES ".*autogen/public.*$" OR NOT ${path} MATCHES ".*build.*$") # If we have the /source/build/autogen/public folder available we create the doxygen for those interfaces also.
LIST(APPEND dir_list ${path})
ENDIF()
ENDFOREACH()
LIST(REMOVE_DUPLICATES dir_list)
string(REPLACE ";" " " dirs "${dir_list}")
SET(${result} ${dirs})
ENDMACRO()
SUBDIRS(DOCSDIRS "${CMAKE_SOURCE_DIR}/docs" ".*.plantuml$|.*.puml$|.*.md$|.*.txt$|.*.sty$|.*.tex$|")
SUBDIRS(SOURCEDIRS "${CMAKE_SOURCE_DIR}/source" ".*.cpp$|.*.hpp$|.*.h$|.*.md$")
# Common config
set(DOXYGEN_CONFIG_PATH ${CMAKE_SOURCE_DIR}/docs/doxy_config)
set(DOXYGEN_IN ${DOXYGEN_CONFIG_PATH}/Doxyfile.in)
set(DOXYGEN_IMAGE_PATH ${CMAKE_SOURCE_DIR}/docs)
set(DOXYGEN_PLANTUML_INCLUDE_PATH ${CMAKE_SOURCE_DIR}/docs)
set(DOXYGEN_OUTPUT_DIRECTORY docs)
# HTML config
set(DOXYGEN_INPUT "${DOCSDIRS} ${SOURCEDIRS}")
set(DOXYGEN_EXCLUDE_PATTERNS "*/tests/* */.*/*")
set(DOXYGEN_FILE_PATTERNS "*.cpp *.hpp *.h *.md")
set(DOXYGEN_RECURSIVE NO)
set(DOXYGEN_GENERATE_LATEX NO)
set(DOXYGEN_GENERATE_HTML YES)
set(DOXYGEN_HTML_DYNAMIC_MENUS NO)
configure_file(${DOXYGEN_IN} ${CMAKE_BINARY_DIR}/DoxyHTML #ONLY)
add_custom_target(docs
COMMAND ${DOXYGEN_EXECUTABLE} ${CMAKE_BINARY_DIR}/DoxyHTML -d Markdown
WORKING_DIRECTORY ${CMAKE_BINARY_DIR}
COMMENT "Generating documentation"
VERBATIM)
and in the Doxyfile we added the environment variables for those fields
OUTPUT_DIRECTORY = #DOXYGEN_OUTPUT_DIRECTORY#
INPUT = #DOXYGEN_INPUT#
FILE_PATTERNS = #DOXYGEN_FILE_PATTERNS#
RECURSIVE = #DOXYGEN_RECURSIVE#
EXCLUDE_PATTERNS = #DOXYGEN_EXCLUDE_PATTERNS#
IMAGE_PATH = #DOXYGEN_IMAGE_PATH#
GENERATE_HTML = #DOXYGEN_GENERATE_HTML#
HTML_DYNAMIC_MENUS = #DOXYGEN_HTML_DYNAMIC_MENUS#
GENERATE_LATEX = #DOXYGEN_GENERATE_LATEX#
PLANTUML_INCLUDE_PATH = #DOXYGEN_PLANTUML_INCLUDE_PATH#
After this we can run cd ./build && cmake ../ && make docs to create our html documentation and have it include the autogenerated interfaces in our source folder without including all the other directories in the build folder.
Quick description of what actually happens in the CMakeLists.txt
# Macro that gets all directories from current_dir recursively and returns the result to result as a space separated string
MACRO(SUBDIRS result current_dir include_regex)
# Gets all files recursively from current_dir
FILE(GLOB_RECURSE children ${current_dir} ${current_dir}/*)
# Filter files so we only keep the files that match the include_regex (can't be to advanced regex)
LIST(FILTER children INCLUDE REGEX "${include_regex}")
SET(dir_list "")
# Let us act on all files... :)
FOREACH(child ${children})
# We're only interested in the path. So we get the path part from the file
get_filename_component(path ${child} DIRECTORY)
# Since CMakes regex also is crippled we can't do nice things such as LIST(FILTER children EXCLUDE REGEX "^((?!autogen/public).)*$") which would have been preferred (CMake regex does not understand negative lookahead/lookbehind)... So we ended up with this ugly thing instead... Adding all build/autogen/public paths and not adding any other paths inside build. I guess it would be possible to write this expression in regex without negative lookahead. But I'm both not really fluent in regex (who are... right?) and a bit lazy in this case. We just needed to get this one pointer task done... :P
IF(${path} MATCHES ".*autogen/public.*$" OR NOT ${path} MATCHES ".*build.*$")
LIST(APPEND dir_list ${path})
ENDIF()
ENDFOREACH()
# Remove all duplicates... Since we GLOBed all files there are a lot of them. So this is important or Doxygen INPUT will overflow... I know... I tested...
LIST(REMOVE_DUPLICATES dir_list)
# Convert the dir_list to a space seperated string
string(REPLACE ";" " " dirs "${dir_list}")
# Return the result! Coffee and cinnamon buns for everyone!
SET(${result} ${dirs})
ENDMACRO()
# Get all the pathes that we want to include in our documentation ... this is also where the build folders for the different applications are going to be... with our autogenerated interfaces which we want to keep.
SUBDIRS(SOURCEDIRS "${CMAKE_SOURCE_DIR}/source" ".*.cpp$|.*.hpp$|.*.h$|.*.md$")
# Add the dirs we want to the Doxygen INPUT
set(DOXYGEN_INPUT "${SOURCEDIRS}")
# Normal exlude patterns for stuff we don't want to add. This thing does not support regex... even though it should.
set(DOXYGEN_EXCLUDE_PATTERNS "*/tests/* */.*/*")
# Normal use of the file patterns that we want to keep in the documentation
set(DOXYGEN_FILE_PATTERNS "*.cpp *.hpp *.h *.md")
# IMPORTANT! Since we are creating all the INPUT paths our self we don't want Doxygen to do any recursion for us
set(DOXYGEN_RECURSIVE NO)
# Write the config
configure_file(${DOXYGEN_IN} ${CMAKE_BINARY_DIR}/DoxyHTML #ONLY)
# Create the target that will use that config to create the html documentation
add_custom_target(docs
COMMAND ${DOXYGEN_EXECUTABLE} ${CMAKE_BINARY_DIR}/DoxyHTML -d Markdown
WORKING_DIRECTORY ${CMAKE_BINARY_DIR}
COMMENT "Generating documentation"
VERBATIM)
I know this isn't the answer anyone who stumbles in on this question wants... unfortunately it seems to be the only reasonable solution...
... you all have my deepest condolences...

OSX find a file with (1) in its name

I have tried lots of variants of find and I can't seem to figure out which one to use to find files with names like
product (1).php
Parentheses have no special meaning in the filename matching pattern used by find, so you can just use:
find name_of_folder -type f -name '*(1)*'
Use quotes as usual to protect the asterisks from being expanded by the shell.

perl quoting in ftp->ls with wildcard

contents of remote directory mydir :
blah.myname.1.txt
blah.myname.somethingelse.txt
blah.myname.randomcharacters.txt
blah.notmyname.1.txt
blah.notmyname.2.txt
...
in perl, I want to download all of this stuff with myname
I am failing really hard with the appropriate quoting. please help.
failed code
my #files;
#files = $ftp->ls( '*.myname.*.txt' ); # finds nothing
#files = $ftp->ls( '.*.myname.*.txt' ); # finds nothing
etc..
How do I put the wildcards so that they are interpreted by the ls, but not by perl? What is going wrong here?
I will assume that you are using the Net::FTP package. Then this part of the docs is interesting:
ls ( [ DIR ] )
Get a directory listing of DIR, or the current directory.
In an array context, returns a list of lines returned from the server. In a scalar context, returns a reference to a list.
This means that if you call this method with no arguments, you get a list of all files from the current directory, else from the directory specified.
There is no word about any patterns, which is not suprising: FTP is just a protocol to transfer files, and this module only a wrapper around that protocoll.
You can do the filtering easily with grep:
my #interesting = grep /pattern/, $ftp->ls();
To select all files that contain the character sequence myname, use grep /myname/, LIST.
To select all files that contain the character sequence .myname., use grep /\.myname\./, LIST.
To select all files that end with the character sequence .txt, use grep /\.txt$/, LIST.
The LIST is either the $ftp->ls or another grep, so you can easily chain multiple filtering steps.
Of course, Perl Regexes are more powerful than that, and we could do all the filtering in a single /\.myname\.[^.]+\.txt$/ or something, depending on your exact requirements. If you are desperate for a globbing syntax, there are tools available to convert glob patterns to regex objects, like Text::Glob, or even to do direct glob matching:
use Text::Glob qw(match_glob);
my #interesting = match_glob ".*.myname.*.txt", $ftp->ls;
However, that is inelegant, to say the least, as regexes are far more powerful and absolutely worth learning.