I am trying to build a script to move old PDFs into an archive folder from their source folder.
I have organize ~15,000 PDFs into a series of folders based on their numerical name. The next challenge is that there are multiple revisions of the same file, IE:
27850_rev0.pdf
27850_rev1.pdf
27850_rev2.pdf
What is the best approach to keeping the highest rev number in the source folder and moving all lower revisions to an archive?
Any help is appreciated.
Thanks,
enter image description here
You can use an expression with Group-Object to isolate all the files that start with that root filename, i.e. 27850*. If you then sort those files you know the last one is the highest revision number:
Get-ChildItem 'C:\temp\06-11-21' -Filter *.txt |
Group-Object -Property { $_.Name.Split('_')[0] } |
ForEach-Object{
$_.Group | Sort-Object Name | Select-Object -SkipLast 1 |
Copy-Item -Destination 'C:\temp\06-11-21_backup'
}
I used a few text files in this example, but it should work just the same.
Note: Obviously you'll have to change the folders and filters...
Group-Object returns GroupInfo objects, so to get the group of original object I reference $_.Groups.
This does depend on the naming format being static. If you have underscores elsewhere in file names we'll likely have a problem. However, we can always adjust the expression.
Related
Is there a way to do something like this in Powershell:
"If more than one file includes a certain set of text, delete all but one"
Example:
"...Cam1....jpg"
"...Cam2....jpg"
"...Cam2....jpg"
"...Cam3....jpg"
Then I would want one of the two "...Cam2....jpg" deleted, while the other one should stay.
I know that I can use something like
gci *Cam2* | del
but I don't know how I can make one of these files stay.
Also, for this to work, I need to look through all the files to see if there are any duplicates, which defeats the purpose of automating this process with a Powershell script.
I searched for a solution to this for a long time, but I just can't find something that is applicable to my scenario.
Get a list of files into a collection and use range operator to select a subset of its elements. To remove all but first element, start from index one. Like so,
$cams = gci "*cam2*"
if($cams.Count -gt 1) {
$cams[1..$cams.Count] | remove-item
}
Expanding on the idea of commenter boxdog:
# Find all duplicately named files.
$dupes = Get-ChildItem c:\test -file -recurse | Group-Object Name | Where-Object Count -gt 1
# Delete all duplicates except the 1st one per group.
$dupes | ForEach-Object { $_.Group | Select-Object -Skip 1 | Remove-Item -Force }
I've split this up into two sub tasks to make it easier to understand. Also it is a good idea to always separate directory iteration from file deletion, to avoid inconsistent results.
First statement uses Group-Object to group files by names. It outputs a Count property containing the number of files per group. Then Where-Object is used to get only groups that contain more than one file, which will be the dupes. The result is stored in variable $dupes, which is an array that looks like this:
Count Name Group
----- ---- -----
2 file1.txt {C:\test\subdir1\file1.txt, C:\test\subdir2\file1.txt}
2 file2.txt {C:\test\subdir1\file2.txt, C:\test\subdir2\file2.txt}
The second statement uses ForEach-Object to iterate over all groups of duplicates. From the Group-Object call of the 1st statement we got a Group property that contains an array of file informations. Using Select-Object -Skip 1 we select all but the 1st element of this array, which are passed to Remove-Item to delete the files.
If I execute:
Get-ChildItem *.ext -recurse
the output consists of a series of Directory sections followed by one or more columns of info for each matching file separated by said directory sections. Is there something like the Unix find command? In which each matching file name appears on a single line with its full relative path?
Get-Childitem by default outputs a view for format-table defined in a format xml file somewhere.
get-childitem | format-table
get-childitem | format-list *
shows you the actual properties in the objects being output. See also How to list all properties of a PowerShell object . Then you can pick and choose the ones you want. This would give the full pathname:
get-childitem | select fullname
If you want an output to be just a string and not an object:
get-childitem | select -expand fullname
get-childitem | foreach fullname
Resolve-Path with the -Relative switch can be used to display the relative paths of a set of paths. You can collect the full path names (FullName property) from the Get-ChildItem command and use the member access operator . to grab the path values only.
Resolve-Path -Path (Get-ChildItem -Filter *.ext -Recurse).FullName -Relative
Note: The relative paths here only accurately reflect files found within the current directory (Get-ChildItem -Path .), i.e. Get-ChildItem -Path NotCurrentDirectory could have undesirable results.
Get-ChildItem's -Name switch does what you want:
It outputs the relative paths (possibly including subdir. components) of matching files as strings (type [string]).
# Lists file / dir. paths as *relative paths* (strings).
# (relative to the input dir, which is implicitly the current one here).
Get-ChildItem -Filter *.ext -Recurse -Name
Note that I've used -Filter, which significantly speeds up the traversal.
Caveat: As of PowerShell 7.0, -Name suffers from performance problems and behavioral quirks; see these GitHub issues:
https://github.com/PowerShell/PowerShell/issues/9014
https://github.com/PowerShell/PowerShell/issues/9119
https://github.com/PowerShell/PowerShell/issues/9126
https://github.com/PowerShell/PowerShell/issues/9122
https://github.com/PowerShell/PowerShell/issues/9120
I am having some problem passing the path plus filename to a parser. There are about 90 files of 1 GB each involved in my task. Each of the file is contained in a folder of its own. All of the folders are contained under a parent folder.
Goal: Ideally, I would like to parse 20 files simultaneously for multitasking and continue to the next 20 until all 90 files are done.
This would mean that I would like to spawn some concurrent parsing of 20 files in a batch at any one given time. In carrying out the parsing, I would like to use measure-command to time the work from beginning to finish.
Script I have used:
Get-ChildItem –Path "E:\\OoonaFTP\\input\\Videos3\\" -Filter *.mp4 -recurse | select -expand fullname
Foreach-Object {
Measure-Command { "E:\OoonaFTP\Ooona_x64_ver_2.5.13\OoonaParser.exe -encode -dat -drm $_.FullName" } | Select-Object -Property TotalSeconds
}
===============================
I have this working batch script with a for statement but doing each iteration one after another. This is not what is the ideal case though. I would really like to accomplish this in PowerShell and with simultaneous tasks.
Could someone please suggest some ways by which I could accomplish this?
Thank you very much!
Thanks for the various suggestions. I'm curious that some of them lead to empty output in my Powershell (PSVersion: 5.1.18362.145).
I tried a number of these and, inspired by some of them, found the best answer for my case at the moment:
Get-ChildItem *.ext -recurse | Select-Object -property fullname
(When I made the window wide enough I got all the info I needed; in general I suppose I might need to do more to get the formatting I want.)
I am trying to create a set of instructions in Powershell on how to:
Retrieve files from several non related folders and count them, with exceptions of certain files and/or subfolders implemented
Give me the last modified file (most recent)
Remove duplicate files based on name, date-time, and filesize, and not just name (same name but different files can be in several folders), because the file could be repeated in the backup parameters as redundant file wildcard/folders, which will mean the exact same file in same path can be counted twice or more, and then ruin my file count.
What I have done so far (example of a browser profile path, after I enter it):
(The "arrows" are pointing to the various parameters)
File File folder many JSON inside folder Folder
| | | | |
v v v v v
GCI bookmarks, 'Current Session', databases\*, extensions\manifest.json, 'Local Storage\*' -Recurse | ? { $_.FullName -inotmatch 'Local Storage\\http* | Databases\\http*'} | Get-Unique | measure-object -line
^ ^
| |
EXCLUSIONS: HTTP* files inside folder HTTP* subfolders inside folder
This already filters all the files I want from the ones I don't want, count them, and remove duplicates, BUT: also removes many Json files with the same name inside different folders, without taking into account file size (though I think it still differentiates dates)
Bottomline what I want is that capability of command line RAR and 7Zip, to know exactly what to include in the archive: We give an input of files and folders, we may by mistake include a subfolder already covered by a previous wild card, we program exceptions (-x! in case of 7zip), and the program knows exactly what files to include and exclude, and without compressing the same file twice
This is so I can know if a new backup is necessary or not, relatively to the previous one (different number of files, or most recently modified file). I know about the "update" function on rar and 7zip, but it's not what I want.
Speaking about the most recently written file, is there a way of some sort of "parallel piping"? A recursive file search that can output the results to 2 commands down the chain, instead of doing a (long) scan for the file count, and repeating the scan to find the most recent file?
What I mean is:
**THIS: **Instead of THIS:
_______ >FILE COUNT
|
SCAN --+ SCAN -->FILE COUNT ; SCAN -->MOST RECENT FILE
|_______ >MOST RECENT FILE
I've done almost all the work, but I hit a wall. All I'm missing is the removal of redundant files (e.g. same exact file in the same path being counted twice or more due to redundant parameters entered, though I want same name files in differents folders to still be counted); and while at it I wouldn't mind to get the last modified file also, so I don't have to repeat the same scan again (powershell can be very slow sometimes).
This last point is less important but it would be nice if it worked though.
Any help you can give me on htis would be greatly appreciated.
Thanks for reading :-)
something along the lines of
#generate an example list with the exact same files listed more than once, and possibly files by the same name in sub-folders
$lst = ls -file; $lst += ls -file -recurse
$UniqueFiles = ($lst | sort -Property FullName -Unique) #remove exact dupes
$UniqueFiles = ($UniqueFiles| sort -Property Directory,Name) #make it look like ls again
# or most recent file per filename
$MostRecent = ($lst | sort -Property LastWriteTime -Descending | group -Property Name | %{$_.group[0]})
Although I don't understand how the file size plays in unless you're looking for files with the same size and name to only be listed once regardless of where it lives in the folder tree. In which case you may want to group by hash value so even if it has a different name it'll still be listed only once.
$MostRecentSameSize = ($lst | sort -Property LastWriteTime -Descending | group -Property #{exp='Name'},#{exp='Length'} | %{$_.group[0]})
# or by hash
$MostRecentByHash = ($lst | sort -Property LastWriteTime -Descending | group -Property #{exp={(Get-FileHash $_ -a md5).hash}} | %{$_.group[0]})
In a purely hypothetical situation of a person that downloaded some TV episodes, but is wondering if he/she accidentally downloaded an HDTV, a WEBRip and a WEB-DL version of an episode, how could PowerShell find these 'duplicates' so the lower quality versions can be automagically deleted?
First, I'd get all the files in the directory:
$Files = Get-ChildItem -Path $Directory -Exclude '*.nfo','*.srt','*.idx','*.sub' |
Sort-Object -Property Name
I exclude the non-video extensions for now, since they would cause false positives. I would still have to deal with them though (during the delete phase).
At this point, I would likely use a ForEach construct to parse through the files one by one and look for files that have the same episode number. If there are any, they should be looked at.
Assuming a common spaces equals dots notation here, a typical filename would be AwesomeSeries.S01E01.HDTV.x264-RLSGRP
To compare, I need to get only the episode number. In the above case, that means S01E01:
If ($File.BaseName -match 'S*(\d{1,2})(x|E)(\d{1,2})') { $EpisodeNumber = $Matches[0] }
In the case of S01E01E02 I would simply add a second if-statement, so I'm not concerned with that for now.
$EpisodeNumber should now contain S01E01. I can use that to discover if there are any other files with that episode number in $Files. I can do that with:
$Files -match $EpisodeNumber
This is where my trouble starts. The above will also return the file I'm processing. I could at this point handle the duplicates immediately, but then I would have to do the Get-ChildItem again because otherwise the same match would be returned when the ForEach construct gets to the duplicate file which would then result in an error.
I could store the files I wish to delete in an array and process them after the ForEach contruct is over, but then I'd still have to filter out all the duplicates. After all, in the ForEach loop,
AwesomeSeries.S01E01.HDTV.x264-RLSGRP
would first match
AwesomeSeries.S01E01.WEB-DL.x264.x264-RLSGRP, only for
AwesomeSeries.S01E01.WEB-DL.x264.x264-RLSGRP
to match
AwesomeSeries.S01E01.HDTV.x264-RLSGRP afterwards.
So maybe I should process every episode number only once, but how?
I get the feeling I'm being very inefficient here and there must be a better way to do this, so I'm asking for help. Can anyone point me in the right direction?
Filter the $Files array to exclude the current file when matching:
($Files | Where-Object {$_.FullName -ne $File.FullName}) -match $EpisodeNumber
Regarding the duplicates in the array the end, you can use Select-Object -Unique to only get distinct entries.
Since you know how to get the episode number let's use that to group the files together.
$Files = Get-ChildItem -Path $Directory -Exclude '*.nfo','*.srt','*.idx','*.sub' | Select-Object FullName, #{Name="EpisodeIndex";Expression={
# We do not have to do it like this but if your detection logic gets more complicated then having
# this select-object block will be a cleaner option then using a calculated property
If ($_.BaseName -match 'S*(\d{1,2})(x|E)(\d{1,2})'){$Matches[0]}
}}
# Group the files by season episode index (that have one). Return groups that have more than one member as those would need attention.
$Files | Where-Object{$_.EpisodeIndex } | Group-Object -Property EpisodeIndex |
Where-Object{$_.Count -gt 1} | ForEach-Object{
# Expand the group members
$_.Group
# Not sure how you plan on dealing with it.
}
I have made a very simple ps1 script that rename my txt files with an "ID" number like
Card.321.txt
This works for simple lines, but I need a mass rename of lines, so I need something different.
$i = 1
Get-ChildItem *.txt | %{Rename-Item $_ -NewName ('Card.{0:D3}.txt' -f $i++)}
But I can only properly run it if there are no other files named Card.xxx.txt
and every day I get one new file that I store in an archive folder after it got renamed.
how can I make a script that doesn't try to do a mass renaming task?
I need a counter that can continue from yesterdays task performed with the same script.
Card.321.txt
Card.322.txt
Card.323.txt
Card.324.txt
Card.325.txt
ToDaysFiledToBeRenamed.txt
How about something like this. Not the most efficient but should be easy to read
$thePath = "C:\temp"
$allTXTFiles = Get-ChildItem $thePath *.txt
$filestoberenamed = $allTXTFiles | Where-Object{$_.BaseName -notmatch "^Card\.\d{3}$"}
$highestNumber = $allTXTFiles | Where-Object{$_.BaseName -match "^Card\.\d{3}$"} |
ForEach-Object{[int]($_.BaseName -split "\.")[-1]} |
Measure-Object -Maximum | Select-Object -ExpandProperty Maximum
$filestoberenamed | ForEach-Object{Rename-Item $_ -NewName ('Card.{0:D3}.txt' -f $highestNumber++)}
Collects all the files and splits them into two groups. Using $allTXTFiles we filter all the files that have the "Card" naming convention and parse out the number. Of those number determine the current highest one as $highestNumber.
Then we take the remaining files as $filestoberenamed and put them through your Rename-Item snippet using $highestNumber as index.
Known Caveats
This would not act correctly if there is ever a file with the number higher than 999. It would allow the creation of one but it currently I am only looking for files with 3 digits. We could change that to ^Card\.\d+$ instead. Depends on what logic you would want.