I have the following code that works for most files. The input file (FoundLinks.csv) is a UTF-8 file with one file path per line. It is full paths of files on a particular drive that I need to process.
$inFiles = #()
$inFiles += #(Get-Content -Path "C:\Users\sw_admin\FoundLinks.csv")
foreach ($inFile in $inFiles) {
Write-Host("Processing: " + $inFile)
$objFile = Get-ChildItem -LiteralPath $inFile
New-Object PSObject -Prop #{
FullName = $objFile.FullName
ModifyTime = $objFile.LastWriteTime
}
}
But even though I've used -LiteralPath, it continues to not be able to process files that have a non-breaking space in the file name.
Processing: q:\Executive\CLC\Budget\Co 2018 Budget - TO Bob (GA Prophix).xlsx
Get-ChildItem : Cannot find path 'Q:\Executive\CLC\Budget\Co 2018 Budget - TO Bob (GA Prophix).xlsx'
because it does not exist.
At ListFilesWithModifyTime.ps1:6 char:29
+ $objFile = Get-ChildItem <<<< -LiteralPath $inFile
+ CategoryInfo : ObjectNotFound: (Q:\Executive\CL...A Prophix).xlsx:String) [Get-ChildItem], ItemNotFound
Exception
+ FullyQualifiedErrorId : PathNotFound,Microsoft.PowerShell.Commands.GetChildItemCommand
I know my input file has the non-breaking space in the path because I'm able to open it in Notepad, copy the offending path, paste into Word, and turn on paragraph marks. It shows a normal space followed by a NBSP just before 2018.
Is PowerShell not reading in the NBSP? Am I passing it wrong to -LiteralPath? I'm at my wit's end. I saw this solution, but in that case they are supplying the path as a literal in the script, so I can't see how I could use that approach.
I've also tried: -Encoding UTF8 parameter on Get-Content, but no difference.
I'm not even sure how I can check $inFile in the code just to confirm if it still contains the NBSP.
Grateful for any help to get unstuck!
Confirmed that $inFile has NBSP
Thank you all! As per #TheMadTechnician, I have updated the code like this, and also reduced my input file to only the one file having a problem.
$inFiles = #()
$inFiles += #(Get-Content -Path "C:\Users\sw_admin\FoundLinks.csv" -Encoding UTF8)
foreach ($inFile in $inFiles) {
Write-Host("Processing: " + $inFile)
# list out all chars to confirm it has an NBSP
$inFile.ToCharArray()|%{"{0} -> {1}" -f $_,[int]$_}
$objFile = Get-ChildItem -LiteralPath $inFile
New-Object PSObject -Prop #{
FullName = $objFile.FullName
ModifyTime = $objFile.LastWriteTime
}
}
And so now I can confirm that $inFile in fact still contains the NBSP just as it gets passed to Get-ChildItem. Yet Get-ChildItem says the file does not exist.
More I've tried:
Same if I use Get-Item instead of Get-ChildItem
Same if I use -Path instead of -LiteralPath
Windows explorer and Excel can deal with the file successfully.
I'm on a Windows 7 machine, Powershell 2.
Thanks again for all the responses!
It's still unclear why Sandra's code didn't work: PowerShell v2+ is capable of retrieving files with paths containing non-ASCII characters; perhaps a non-NTFS filesystem with different character encoding was involved?
However, the following workaround turned out to be effective:
$objFile = Get-ChildItem -Path ($inFile -replace ([char] 0xa0), '?')
The idea is to replace the non-breaking space char. (Unicode U+00A0; hex. 0xa) in the input file path with wildcard character ?, which represents any single char.
For Get-ChildItem to perform wildcard matching, -Path rather than -LiteralPath must be used (note that -Path is actually the default if you pass a path argument positionally, as the first argument).
Hypothetically, the wildcard-based paths could match multiple files; if that were the case, the individual matches would have to be examined to identify the specific match that has a non-breaking space in the position of the ?.
Get-ChildItem is for listing children so you would be giving it a directory, but it seems you are giving it a file, so when it says it cannot find the path, it's because it can't find a directory with that name.
Instead, you would want to use Get-Item -LiteralPath to get each individual item (this would be the same items you would get if you ran Get-ChildItem on its parent.
I think swapping in Get-Item would make your code work as is.
After testing, I think the above is in fact false, so sorry for that, but I will leave the below in case it's helpful, even though it may not solve your immediate problem.
But let's take a look at how it can be simplified with the pipeline.
First, you're starting with an empty array, then calling a command (Get-Content) which likely already returns an array, wrapping that in an array, then concatenating it to the empty one.
You could just do:
$inFiles = Get-Content -Path "C:\Users\sw_admin\FoundLinks.csv"
Yes, there is a chance that $inFiles will contain only a single item and not an array at all.
But the nice thing is that foreach won't mind one bit!
You can do something like this and it just works:
foreach ($string in "a literal single string") {
Write-Host $string
}
But Get-Item (and Get-ChildItem for that matter) accept pipeline input, so they accept multiple items.
That means you could do this:
$inFiles = Get-Content -Path "C:\Users\sw_admin\FoundLinks.csv" | Get-Item
foreach ($inFile in $inFiles) {
Write-Host("Processing: " + $inFile)
New-Object PSObject -Prop #{
FullName = $inFile.FullName
ModifyTime = $inFile.LastWriteTime
}
}
But even more than that, there is a pipeline-aware cmdlet for processing items, called ForEach-Object, to which you pass a [ScriptBlock], in which $_ represents the current item, so we could do it like this:
Get-Content -Path "C:\Users\sw_admin\FoundLinks.csv" |
Get-Item |
ForEach-Object -Process {
Write-Host("Processing: " + $_)
New-Object PSObject -Prop #{
FullName = $_.FullName
ModifyTime = $_.LastWriteTime
}
}
All in one pipeline!
But further, you're creating a new object with the 2 properties you want.
PowerShell has a nifty cmdlet called Select-Object which takes an input object and returns a new object containing only the properties you want; this would make for a cleaner syntax:
Get-Content -Path "C:\Users\sw_admin\FoundLinks.csv" |
Get-Item |
Select-Object -Property FullName,LastWriteTime
This is the power of the the pipeline passing real objects from one command to another.
I realize this last example does not write the processing message to the screen, however you could re-add that in if you wanted:
Get-Content -Path "C:\Users\sw_admin\FoundLinks.csv" |
Get-Item |
ForEach-Object -Process {
Write-Host("Processing: " + $_)
$_ | Select-Object -Property FullName,LastWriteTime
}
But you might also consider that many cmdlets support verbose output and try to just add -Verbose to some of your existing cmdlets. Sadly, it won't really help in this case.
One final note, when you pass items to the filesystem cmdlets via pipeline, the parameter they bind to is in fact -LiteralPath, not -Path, so your special characters are still safe.
I just run into the same issue. Looks like get-childitem ak gci expects the path in unicode (UTF-16). So either convert the csv file into unicode or convert the lines that include the path as unicode within your script.
Testet on PS 5.1.22621.608
Related
I've tried solving the following case:
many small text files (in subfolders) need their content (lines) matched to lines that exist in another (large) text file. The small files then need to be updated or copied with those matching Lines.
I was able to come up with some running code for this but I need to improve it or use a complete other method because it is extremely slow and would take >40h to get through all files.
One idea I already had was to use a SQL Server to bulk-import all files in a single table with [relative path],[filename],[jap content] and the translation file in a table with [jap content],[eng content] and then join [jap content] and bulk-export the joined table as separate files using [relative path],[filename]. Unfortunately I got stuck right at the beginning due to formatting and encoding issues so I dropped it and started working on a PowerShell script.
Now in detail:
Over 40k txt files spread across multiple subfolders with multiple lines each, every line can exist in multiple files.
Content:
UTF8 encoded Japanese text that also can contain special characters like \\[*+(), each Line ending with a tabulator character. Sounds like csv files but they don't have headers.
One large File with >600k Lines containing the translation to the small files. Every line is unique within this file.
Content:
Again UTF8 encoded Japanese text. Each line formatted like this (without brackets):
[Japanese Text][tabulator][English Text]
Example:
テスト[1] Test [1]
End result should be a copy or a updated version of all these small files where their lines got replaced with the matching ones of the translation file while maintaining their relative path.
What I have at the moment:
$translationfile = 'B:\Translation.txt'
$inputpath = 'B:\Working'
$translationarray = [System.Collections.ArrayList]#()
$translationarray = #(Get-Content $translationfile -Encoding UTF8)
Get-Childitem -path $inputpath -Recurse -File -Filter *.txt | ForEach-Object -Parallel {
$_.Name
$filepath = ($_.Directory.FullName).substring(2)
$filearray = [System.Collections.ArrayList]#()
$filearray = #(Get-Content -path $_.FullName -Encoding UTF8)
$filearray = $filearray | ForEach-Object {
$result = $using:translationarray -match ("^$_" -replace '[[+*?()\\.]','\$&')
if ($result) {
$_ = $result
}
$_
}
If(!(test-path B:\output\$filepath)) {New-Item -ItemType Directory -Force -Path B:\output\$filepath}
#$("B:\output\"+$filepath+"\")
$filearray | Out-File -FilePath $("B:\output\" + $filepath + "\" + $_.Name) -Force -Encoding UTF8
} -ThrottleLimit 10
I would appreciate any help and ideas but please keep in mind that I rarely write scripts so anything to complex might fly right over my head.
Thanks
As zett42 states, using a hash table is your best option for mapping the Japanese-only phrases to the dual-language lines.
Additionally, use of .NET APIs for file I/O can speed up the operation noticeably.
# Be sure to specify all paths as full paths, not least because .NET's
# current directory usually differs from PowerShell's
$translationfile = 'B:\Translation.txt'
$inPath = 'B:\Working'
$outPath = (New-Item -Type Directory -Force 'B:\Output').FullName
# Build the hashtable mapping the Japanese phrases to the full lines.
# Note that ReadLines() defaults to UTF-8
$ht = #{ }
foreach ($line in [IO.File]::ReadLines($translationfile)) {
$ht[$line.Split("`t")[0] + "`t"] = $line
}
Get-ChildItem $inPath -Recurse -File -Filter *.txt | Foreach-Object -Parallel {
# Translate the lines to the matching lines including the $translation
# via the hashtable.
# NOTE: If an input line isn't represented as a key in the hashtable,
# it is passed through as-is.
$lines = foreach ($line in [IO.File]::ReadLines($_.FullName)) {
($using:ht)[$line] ?? $line
}
# Synthesize the output file path, ensuring that the target dir. exists.
$outFilePath = (New-Item -Force -Type Directory ($using:outPath + $_.Directory.FullName.Substring(($using:inPath).Length))).FullName + '/' + $_.Name
# Write to the output file.
# Note: If you want UTF-8 files *with BOM*, use -Encoding utf8bom
Set-Content -Encoding utf8 $outFilePath -Value $lines
} -ThrottleLimit 10
Note: Your use of ForEach-Object -Parallel implies that you're using PowerShell [Core] 7+, where BOM-less UTF-8 is the consistent default encoding (unlike in Window PowerShell, where default encodings vary wildly).
Therefore, in lieu of the .NET [IO.File]::ReadLines() API in a foreach loop, you could also use the more PowerShell-idiomatic switch statement with the -File parameter for efficient line-by-line text-file processing.
Can anyone help me w/ a code puzzle in powershell? I'm trying to look at a specific directory on several remote servers, and find the deepest nested subfolder in that directory and then count number of parent folders. Pseudo code below.
$servers = get-content (list of servers) and $path = (targetdir on remote machine)
for each $s in $servers:
find the longest path
count the # of \ (to identify # of subfolders)
Write output to file $Servername $countOfNestedFolders
Sorry I'm just good enough w/ posh to be a little dangerous.
Since you're trying to find the biggest count, it sounds like you'll want to do a comparative. Basically, start with a size of 0 - if the folder you're looking at is bigger than that, then it becomes the biggest. You do this for all the folders until you're left with the biggest folder. Note, this method won't work if there are any ties, but it doesn't sound like that's what you're looking for. I should add this is the main code for looking at a single computer. You can wrap a foreach {$server in $servers} around this for multiple servers.
$folders = Get-ChildItem -Path "C:\Directory" -Directory -Recurse
$n = 0
$biggest = ""
foreach ($folder in $folders)
{
$splitout = $folder.FullName.split("\")
if ($splitout.count -gt $n)
{
$n = $splitout.count
$biggest = $folder
}
}
Write-host "Count $n - $biggest"
here's a slight variant of the "count the path parts" solutions. [grin] it counts the delimiters. if your paths are UNC paths OR local paths, this will still give you the deepest nested dir.
however, it will not work with mixed UNC [\\SysName\ShareName] and local [c:\] paths.
also, it does not remove the starting dir from the result.
also also, i am unsure how you want to count number of parent folders. so i just posted the delimiter count.
what it does ...
sets the top dir to work from
gets the dir delimiter char
creates a regex escaped version of that char
grabs all the dirs in the target dir tree
sorts [in descending order] them by the string length of what is left over when you remove everything except the dir delimiters
grabs the 1st of those dirs
displays the .FullName of that dir
displays the number of dir delimiters in the above string
the code ...
$TargetTopDir = $env:APPDATA
$DirDelim = [System.IO.Path]::DirectorySeparatorChar
$RegexDD = [regex]::Escape($DirDelim)
$DirList = Get-ChildItem -LiteralPath $TargetTopDir -Directory -Recurse
$DeepestNestedDir = ($DirList |
Sort-Object {$_.FullName -replace "[^$RegexDD]"} -Descending)[0]
$DeepestNestedDir.FullName
'DirDelimCount = {0}' -f ($DeepestNestedDir.FullName -replace "[^$RegexDD]").Length
output ...
C:\Users\MyUserName\AppData\Roaming\Thunderbird\Profiles\shkjhmpc.default\extensions\{e2fda1a4-762b-4020-b5ad-a41df1933103}\chrome\calendar-gd\locale\gd\calendar\dialogs
DirDelimCount = 15
This got it done; thanks again for all the help!
$servers = gc C:\serverlist.txt
ForEach ($server in $servers){
$folder = "\\$server\x$\share"
$TargetTopDir = $folder
$DirDelim = [System.IO.Path]::DirectorySeparatorChar
$RegexDD = [regex]::Escape($DirDelim)
$DirList = Get-ChildItem -LiteralPath $TargetTopDir -Directory -Recurse -ErrorAction
SilentlyContinue
$DeepestNestedDir = ($DirList | Sort-Object {$_.FullName -replace "[^$RegexDD]"} -
Descending)[0]
$DepthCount = '{0}' -f ($DeepestNestedDir.FullName -replace "[^$RegexDD]").Length
$arrayItems = #{
"Depth Count" = $DepthCount - 3
"Path Name" = $DeepestNestedDir.FullName
"Server Name" = $server
}
$output= #()
$output += New-Object -TypeName PSObject -Property $arrayItems
$output | Export-CSV C:\Output.csv -NoTypeInformation -Append
}
To solve your core problem:
For a given $path, you can find the maximum directory depth in its subtree - expressed as the number of path separators (\ on Windows, / on Unix) plus one in the full path of the most deeply nested subdirectories inside $path - as follows:
# Outputs the number of path components of the most deeply nested folder in $path.
(Get-ChildItem $path -Recurse -Directory |
Measure-Object -Maximum { ($_.FullName -split '[\\/]').Count }
).Maximum
Note: If you wanted to know the relative depth - relative to $path, add -Name to the Get-ChildItem call and replace $_.FullName with $_ inside the script block ({ ... }) passed to Measure-Object. A result of 0 then means that $path has no subdirectories at all, 1 means that there are only immediate subdirectories, 2 means that the immediate subdirectories have (only) subdirectories themselves, ...
Get-ChildItem -Recurse -Directory $path outputs all subdirectories (-Directory) in the entire subtree of (-Recurse) of directory $path; add -Force to include hidden subdirs. - see Get-ChildItem.
Measure-Object -Maximum { ($_.FullName -split '[\\/]').Count } calculates the count of path separators ([\\/] is a regex that matches both a single \ and / char.) in each directory's full path ($_.FullName) - using a script block {...} as the (implied) -Property argument inside of which $_ represents the input path at hand - and determines the maximum (-Maximum); given that Measure-Object outputs a Microsoft.PowerShell.Commands.GenericMeasureInfo instance, the raw maximum value is accessed via the .Maximum property.
All incidental tasks - applying this calculation to multiple servers, writing the results to server-specific files - can be accomplished with the usual cmdlets (Get-Content, ForEach-Object, Set-Content or Out-File / >).
A faster alternative:
The above command is concise and PowerShell-idiomatic, but somewhat slow.
Here's a significantly faster alternative that uses LINQ and .NET APIs directly:
# Note: Makes sure that $path is a *full* path, because .NET's current
# directory usually differs from PowerShell's.
1 + [Linq.Enumerable]::Max(
([System.IO.Directory]::GetDirectories(
$path, '*', 'AllDirectories'
) -replace '[^\\/]').ForEach('Length')
)
Note: The above invariably includes hidden directories too. In .NET Core / .NET 5+, [System.IO.Directory]::GetDirectories() now provides an additional overload that provides more control over the enumeration.
Listing the maximum-depth directories too:
If you want not just to calculate the maximum depth, but also want to list all directories that have the maximum depth (note that there can be more than one):
# Sample input path.
# Note: Makes sure that $path is a *full* path, because .NET's current
# directory usually differs from PowerShell's.
$path = $PWD
# Extract all directories with the max. depth using Group-Object:
# Group by the calculated depth and extract the last group, which relies on
# Group-Object outputting the results sorted by grouping criteria.
$maxDepthGroup =
[System.IO.Directory]::GetDirectories($path, '*', 'AllDirectories') |
Group-Object { ($_ -split '[\\/]').Count } |
Select-Object -Last 1
# Construct the output object.
[pscustomobject] #{
MaxDepth = $maxDepthGroup.Values[0] # The grouping criterion, i.e. the depth.
MaxDepthDirs = $maxDepthGroup.Group # The paths comprising the group.
}
The output is a custom object with .MaxDepth and .MaxDepthDirs (an array of the full paths of those dirs. that have the max. depth) properties. If you pipe it to Format-List, you'll get something like:
MaxDepth : 6
MaxDepthDirs : {/Users/jdoe/Documents/Ram Dass Audio Collection/The Path of Service, /Users/jdoe/Documents/Ram Dass Audio Collection/Conscious Aging,
/Users/jdoe/Documents/Ram Dass Audio Collection/Cultivating the Heart of Compassion, /Users/jdoe/Documents/Cheatsheets/YAML Ain't
Markup Language_files}
I'm trying to gather a report of long directory paths to provide to each user who has them so that they can use it and make this folders paths short.
How can I replace \\server\Share$ to X: ? I tried the below but nothing changes. I can only get results if I do only one character or one string "\\server" but not the combination "\\server\Share$" can someone tell me what I'm doing wrong.
$results= "\\\server\Share$\super\long\directory\path\"
$usershare="\\\server\Share$"
$Results | ForEach-Object { $_.FullName = $_.FullName -replace "$usershare", 'X:' }
The output I need is which is what the users will see in their systems.
X:\super\long\directory\path\
Because the $userShare variable contains characters that have special meaning in Regular Expressions (and -replace uses Regex), you need to [Regex]::Escape() that string.
First thing to notice is that you start the UNC paths with three backslashes, where you should only have two.
Next is that your $results variable is simply declared as string and should probably be the result of a Get-ChildItem command..
I guess what you want to do is something like this:
$uncPath = "\\server\Share$\super\long\directory\path\" #"# the UNC folder path
$usershare = "\\server\Share$"
$results = Get-ChildItem -Path $uncPath | ForEach-Object {
$_.FullName -replace ([regex]::Escape($usershare)), 'X:'
}
Hope that helps
I am brand new to scripting (or coding of any sort). I had an issue where I wanted to generate csv files to catalog directories and certain file names to aid in my work. I was able to put something together that works for what I need. With one exception, long names return the following error:
ERROR: The specified path, file name, or both are too long. The fully qualified file name must be less than 260 characters, and the directory name must be less than 248 characters.
Here is my script:
Write-Host "Andy's File Lister v2.2"
$drive = Read-Host "R or Q?"
$client = Read-Host "What is the client's name as it appears on the R or Q drive?"
$path= "${drive}:\${client}"
Get-ChildItem $path -Recurse -dir | Select-Object FullName | Export-CSV $home\downloads\"$client directories.csv"
Get-ChildItem $path -Recurse -Include *.pdf, *.jp*, *.xl*, *.doc* | Select-Object FullName | Export-CSV $home\downloads\"$client files.csv"
Write-Host "Check your downloads folder."
Pause
As I said, I am brand new to this. Is there a different command I could use, or a way to tell the script to skip directory names or files over a certain length?
Thanks!
You can check the value of the .Length child property of the .FullName property of each item you check, and if it's greater than 256 characters, use Out-Null:
Ex.
$items = Get-ChildItem -Path C:\users\myusername\desktop\myfolder
foreach($item in $items)
{
if($item.FullName.Length -lt 256)
{
do some stuff
}
elseif($item.FullName.Length)
{
Out-Null
}
}
If you want to check the parent folder's path as well, you could check
$item.Parent.FullName.Length
in your processing as well.
I think you should close your strings on lines 5 and 6.
Instead of using ", you should use \" because currently your script parses the entire line 6 as a one string.
I have a source tree, say c:\s, with many sub-folders. One of the sub-folders is called "c:\s\Includes" which can contain one or more .cs files recursively.
I want to make sure that none of the .cs files in the c:\s\Includes... path exist in any other folder under c:\s, recursively.
I wrote the following PowerShell script which works, but I'm not sure if there's an easier way to do it. I've had less than 24 hours experience with PowerShell so I have a feeling there's a better way.
I can assume at least PowerShell 3 being used.
I will accept any answer that improves my script, but I'll wait a few days before accepting the answer. When I say "improve", I mean it makes it shorter, more elegant or with better performance.
Any help from anyone would be greatly appreciated.
The current code:
$excludeFolder = "Includes"
$h = #{}
foreach ($i in ls $pwd.path *.cs -r -file | ? DirectoryName -notlike ("*\" + $excludeFolder + "\*")) { $h[$i.Name]=$i.DirectoryName }
ls ($pwd.path + "\" + $excludeFolder) *.cs -r -file | ? { $h.Contains($_.Name) } | Select #{Name="Duplicate";Expression={$h[$_.Name] + " has file with same name as " + $_.Fullname}}
1
I stared at this for a while, determined to write it without studying the existing answers, but I'd already glanced at the first sentence of Matt's answer mentioning Group-Object. After some different approaches, I get basically the same answer, except his is long-form and robust with regex character escaping and setup variables, mine is terse because you asked for shorter answers and because that's more fun.
$inc = '^c:\\s\\includes'
$cs = (gci -R 'c:\s' -File -I *.cs) | group name
$nopes = $cs |?{($_.Group.FullName -notmatch $inc)-and($_.Group.FullName -match $inc)}
$nopes | % {$_.Name; $_.Group.FullName}
Example output:
someFile.cs
c:\s\includes\wherever\someFile.cs
c:\s\lib\factories\alt\someFile.cs
c:\s\contrib\users\aa\testing\someFile.cs
The concept is:
Get all the .cs files in the whole source tree
Split them into groups of {filename: {files which share this filename}}
For each group, keep only those where the set of files contains any file with a path that matches the include folder and contains any file with a path that does not match the includes folder. This step covers
duplicates (if a file only exists once it cannot pass both tests)
duplicates across the {includes/not-includes} divide, instead of being duplicated within one branch
handles triplicates, n-tuplicates, as well.
Edit: I added the ^ to $inc to say it has to match at the start of the string, so the regex engine can fail faster for paths that don't match. Maybe this counts as premature optimization.
2
After that pretty dense attempt, the shape of a cleaner answer is much much easier:
Get all the files, split them into include, not-include arrays.
Nested for-loop testing every file against every other file.
Longer, but enormously quicker to write (it runs slower, though) and I imagine easier to read for someone who doesn't know what it does.
$sourceTree = 'c:\\s'
$allFiles = Get-ChildItem $sourceTree -Include '*.cs' -File -Recurse
$includeFiles = $allFiles | where FullName -imatch "$($sourceTree)\\includes"
$otherFiles = $allFiles | where FullName -inotmatch "$($sourceTree)\\includes"
foreach ($incFile in $includeFiles) {
foreach ($oFile in $otherFiles) {
if ($incFile.Name -ieq $oFile.Name) {
write "$($incFile.Name) clash"
write "* $($incFile.FullName)"
write "* $($oFile.FullName)"
write "`n"
}
}
}
3
Because code-golf is fun. If the hashtables are faster, what about this even less tested one-liner...
$h=#{};gci c:\s -R -file -Filt *.cs|%{$h[$_.Name]+=#($_.FullName)};$h.Values|?{$_.Count-gt1-and$_-like'c:\s\includes*'}
Edit: explanation of this version: It's doing much the same solution approach as version 1, but the grouping operation happens explicitly in the hashtable. The shape of the hashtable becomes:
$h = {
'fileA.cs': #('c:\cs\wherever\fileA.cs', 'c:\cs\includes\fileA.cs'),
'file2.cs': #('c:\cs\somewhere\file2.cs'),
'file3.cs': #('c:\cs\includes\file3.cs', 'c:\cs\x\file3.cs', 'c:\cs\z\file3.cs')
}
It hits the disk once for all the .cs files, iterates the whole list to build the hashtable. I don't think it can do less work than this for that bit.
It uses +=, so it can add files to the existing array for that filename, otherwise it would overwrite each of the hashtable lists and they would be one item long for only the most recently seen file.
It uses #() - because when it hits a filename for the first time, $h[$_.Name] won't return anything, and the script needs put an array into the hashtable at first, not a string. If it was +=$_.FullName then the first file would go into the hashtable as a string and the += next time would do string concatenation and that's no use to me. This forces the first file in the hashtable to start an array by forcing every file to be a one item array. The least-code way to get this result is with +=#(..) but that churn of creating throwaway arrays for every single file is needless work. Maybe changing it to longer code which does less array creation would help?
Changing the section
%{$h[$_.Name]+=#($_.FullName)}
to something like
%{if (!$h.ContainsKey($_.Name)){$h[$_.Name]=#()};$h[$_.Name]+=$_.FullName}
(I'm guessing, I don't have much intuition for what's most likely to be slow PowerShell code, and haven't tested).
After that, using h.Values isn't going over every file for a second time, it's going over every array in the hashtable - one per unique filename. That's got to happen to check the array size and prune the not-duplicates, but the -and operation short circuits - when the Count -gt 1 fails, the so the bit on the right checking the path name doesn't run.
If the array has two or more files in it, the -and $_ -like ... executes and pattern matches to see if at least one of the duplicates is in the includes path. (Bug: if all the duplicates are in c:\cs\includes and none anywhere else, it will still show them).
--
4
This is edited version 3 with the hashtable initialization tweak, and now it keeps track of seen files in $s, and then only considers those it's seen more than once.
$h=#{};$s=#{};gci 'c:\s' -R -file -Filt *.cs|%{if($h.ContainsKey($_.Name)){$s[$_.Name]=1}else{$h[$_.Name]=#()}$h[$_.Name]+=$_.FullName};$s.Keys|%{if ($h[$_]-like 'c:\s\includes*'){$h[$_]}}
Assuming it works, that's what it does, anyway.
--
Edit branch of topic; I keep thinking there ought to be a way to do this with the things in the System.Data namespace. Anyone know if you can connect System.Data.DataTable().ReadXML() to gci | ConvertTo-Xml without reams of boilerplate?
I'd do more or less the same, except I'd build the hashtable from the contents of the includes folder and then run over everything else to check for duplicates:
$root = 'C:\s'
$includes = "$root\includes"
$includeList = #{}
Get-ChildItem -Path $includes -Filter '*.cs' -Recurse -File |
% { $includeList[$_.Name] = $_.DirectoryName }
Get-ChildItem -Path $root -Filter '*.cs' -Recurse -File |
? { $_.FullName -notlike "$includes\*" -and $includeList.Contains($_.Name) } |
% { "Duplicate of '{0}': {1}" -f $includeList[$_.Name], $_.FullName }
I'm not as impressed with this as I would like but I thought that Group-Object might have a place in this question so I present the following:
$base = 'C:\s'
$unique = "$base\includes"
$extension = "*.cs"
Get-ChildItem -Path $base -Filter $extension -Recurse |
Group-Object $_.Name |
Where-Object{($_.Count -gt 1) -and (($_.Group).FullName -match [regex]::Escape($unique))} |
ForEach-Object {
$filename = $_.Name
($_.Group).FullName -notmatch [regex]::Escape($unique) | ForEach-Object{
"'{0}' has file with same name as '{1}'" -f (Split-Path $_),$filename
}
}
Collect all the files with the extension filter $extension. Group the files based on their names. Then of those groups find every group where there are more than one of that particular file and one of the group members is at least in the directory $unique. Take those groups and print out all the files that are not from the unique directory.
From Comment
For what its worth this is what I used for testing to create a bunch of files. (I know the folder 9 is empty)
$base = "E:\Temp\dev\cs"
Remove-Item "$base\*" -Recurse -Force
0..9 | %{[void](New-Item -ItemType directory "$base\$_")}
1..1000 | %{
$number = Get-Random -Minimum 1 -Maximum 100
$folder = Get-Random -Minimum 0 -Maximum 9
[void](New-Item -Path $base\$folder -ItemType File -Name "$number.txt" -Force)
}
After looking at all the others, I thought I would try a different approach.
$includes = "C:\s\includes"
$root = "C:\s"
# First script
Measure-Command {
[string[]]$filter = ls $includes -Filter *.cs -Recurse | % name
ls $root -include $filter -Recurse -Filter *.cs |
Where-object{$_.FullName -notlike "$includes*"}
}
# Second Script
Measure-Command {
$filter2 = ls $includes -Filter *.cs -Recurse
ls $root -Recurse -Filter *.cs |
Where-object{$filter2.name -eq $_.name -and $_.FullName -notlike "$includes*"}
}
In my first script, I get all the include files into a string array. Then i use that string array as a include param on the get-childitem. In the end, I filter out the include folder from the results.
In my second script, I enumerate everything and then filter after the pipe.
Remove the measure-command to see the results. I was using that to check the speed. With my dataset, the first one was 40% faster.
$FilesToFind = Get-ChildItem -Recurse 'c:\s\includes' -File -Include *.cs | Select Name
Get-ChildItem -Recurse C:\S -File -Include *.cs | ? { $_.Name -in $FilesToFind -and $_.Directory -notmatch '^c:\s\includes' } | Select Name, Directory
Create a list of file names to look for.
Find all files that are in the list but not part of the directory the list was generated from
Print their name and directory