Compare contents of 6 objects and delete which are not matching - powershell

I have some 6 files which are created dynamically (so,I dont know the contents). I need to compare these 6 files (exactly speaking compare one file with 5 others) and see what all contents in the file 1 are matching with the other 5. The contents which are matching should be saved, others need to be deleted.
I coded something like below, but is deleting everything (which are matching too).
$lines = Get-Content "C:\snaps.txt"
$check1 = Get-Content "C:\Previous_day_latest.txt"
$check2 = Get-Content "C:\this_week_saved_snaps.txt"
$check3 = Get-Content "C:\all_week_latest_snapshots.txt"
$check4 = Get-Content "C:\each_month_latest.txt"
$check5 = Get-Content "C:\exclusions.txt"
foreach($l in $lines)
{
if(($l -notmatch $check1) -and ($l -notmatch $check2) -and ($l -notmatch $check3) -and ($l -notmatch $check4))
{
Remove-Item -Path "C:\$l.txt"
}else
{
#nothing
}
}
foreach($ch in $check5)
{
Remove-Item -Path "C:\$ch.txt"
}
Contents of 6 files will be as shown below:
$lines
testinstance-01-07-15-08-00
testinstance-10-07-15-23-00
testinstance-13-02-15-13-00
testinstance-15-06-15-23-00
testinstance-19-01-15-23-00
testinstance-23-05-15-20-00
testinstance-27-03-15-23-00
testinstance-28-02-15-23-00
testinstance-29-07-15-08-00
testinstance-30-04-15-23-00
testinstance-30-06-15-23-00
testinstance-31-01-15-23-00
testinstance-31-12-14-23-00
$check1
testinstance-29-07-15-08-00
$check2
testinstance-23-05-15-20-00
testinstance-27-03-15-23-00
$check3
testinstance-01-07-15-23-00
testinstance-13-02-15-13-00
testinstance-19-01-15-23-00
$check4
testinstance-28-02-15-23-00
testinstance-30-04-15-23-00
testinstance-30-06-15-23-00
testinstance-31-01-15-23-00
$check5
testinstance-31-12-14-23-00
I've read about compare-object. But not sure how that can be implemented in my case as contents of all 5 files will be different and all those contents should be saved from deletion. Can someone please guide me to achieve what I said.? Any help would be really appreciated.

I would create an array of the files to check so you can simply add new files without modifying other parts of your script.
I use the where cmdlet which filters all lines that are in the reference file using -in condition and finally overwrite the file:
$referenceFile = 'C:\snaps.txt'
$compareFiles = #(
'C:\Previous_day_latest.txt',
'C:\this_week_saved_snaps.txt',
'C:\all_week_latest_snapshots.txt',
'C:\each_month_latest.txt',
'C:\exclusions.txt'
)
# get the content of the reference file
$referenceContent = (gc $referenceFile)
foreach ($file in $compareFiles)
{
# get the content of the file to check
$content = (gc $file)
# filter all contents from the file to check which are in the reference file and save it
$content | where { $_ -in $referenceContent } | sc $file
}

You can use the -contains operator to compare array contents. If you open all the files you want to check and store into an array, you can compare that with the reference file:
$lines = Get-Content "C:\snaps.txt"
$check1 = "C:\Previous_day_latest.txt"
$check2 = "C:\this_week_saved_snaps.txt"
$check3 = "C:\all_week_latest_snapshots.txt"
$check4 = "C:\each_month_latest.txt"
$check5 = "C:\exclusions.txt"
$checklines = #()
(1..5) | ForEach-Object {
$comp = Get-Content $(Get-Variable check$_).value
$checklines += $comp
}
$matches = $lines | ? { $checklines -contains $_ }
If you switch the -contains to -notcontains you'll see the three lines that don't match

The other answers here are great but I wanted to show you that Compare-Object could still work. You need to use it in a loop however. Just to try and show something else I included a simple use of Join-Path for building the array of checks. Basically we are saving some typing when you move your files to a production area. Update one path instead of more.
$rootPath = "C:\"
$fileNames = "Previous_day_latest.txt", "this_week_saved_snaps.txt", "all_week_latest_snapshots.txt", "each_month_latest.txt", "exclusions.txt"
$lines = Get-Content (Join-path $rootPath "snaps.txt")
$checks = $fileNames | ForEach-Object{Join-Path $rootPath $_}
ForEach($check in $checks){
Compare-Object -ReferenceObject $lines -DifferenceObject (Get-Content $check) -IncludeEqual |
Where-Object{$_.SideIndicator -eq "=="} |
Select-Object -ExpandProperty InputObject |
Set-Content $check
}
So we take each file path and use Compare-Object in a loop comparing each to the $lines array. Using -IncludeEqual we find the lines that both files share and write those back to the file.
Depending on how many checks you have and where they are it might be easier to have this line to build the array $checks
$checks = Get-ChildItem "C:\" -Filter "*.txt" | Select-Object -Expand FullName

Related

Memory exception while filtering large CSV files

getting memory exception while running this code. Is there a way to filter one file at a time and write output and append after processing each file. Seems the below code loads everything to memory.
$inputFolder = "C:\Change\2019\October"
$outputFile = "C:\Change\2019\output.csv"
Get-ChildItem $inputFolder -File -Filter '*.csv' |
ForEach-Object { Import-Csv $_.FullName } |
Where-Object { $_.machine_type -eq 'workstations' } |
Export-Csv $outputFile -NoType
May be can you export and filter your files one by one and append result into your output file like this :
$inputFolder = "C:\Change\2019\October"
$outputFile = "C:\Change\2019\output.csv"
Remove-Item $outputFile -Force -ErrorAction SilentlyContinue
Get-ChildItem $inputFolder -Filter "*.csv" -file | %{import-csv $_.FullName | where machine_type -eq 'workstations' | export-csv $outputFile -Append -notype }
Note: The reason for not using Get-ChildItem ... | Import-Csv ... - i.e., for not directly piping Get-ChildItem to Import-Csv and instead having to call Import-Csv from the script block ({ ... } of an auxiliary ForEach-Object call, is a bug in Windows PowerShell that has since been fixed in PowerShell Core - see the bottom section for a more concise workaround.
However, even output from ForEach-Object script blocks should stream to the remaining pipeline commands, so you shouldn't run out of memory - after all, a salient feature of the PowerShell pipeline is object-by-object processing, which keeps memory use constant, irrespective of the size of the (streaming) input collection.
You've since confirmed that avoiding the aux. ForEach-Object call does not solve the problem, so we still don't know what causes your out-of-memory exception.
Update:
This GitHub issue contains clues as to the reason for excessive memory use, especially with many properties that contain small amounts of data.
This GitHub feature request proposes using strongly typed output objects to help the issue.
The following workaround, which uses the switch statement to process the files as text files, may help:
$header = ''
Get-ChildItem $inputFolder -Filter *.csv | ForEach-Object {
$i = 0
switch -Wildcard -File $_.FullName {
'*workstations*' {
# NOTE: If no other columns contain the word `workstations`, you can
# simplify and speed up the command by omitting the `ConvertFrom-Csv` call
# (you can make the wildcard matching more robust with something
# like '*,workstations,*')
if ((ConvertFrom-Csv "$header`n$_").machine_type -ne 'workstations') { continue }
$_ # row whose 'machine_type' column value equals 'workstations'
}
default {
if ($i++ -eq 0) {
if ($header) { continue } # header already written
else { $header = $_; $_ } # header row of 1st file
}
}
}
} | Set-Content $outputFile
Here's a workaround for the bug of not being able to pipe Get-ChildItem output directly to Import-Csv, by passing it as an argument instead:
Import-Csv -LiteralPath (Get-ChildItem $inputFolder -File -Filter *.csv) |
Where-Object { $_.machine_type -eq 'workstations' } |
Export-Csv $outputFile -NoType
Note that in PowerShell Core you could more naturally write:
Get-ChildItem $inputFolder -File -Filter *.csv | Import-Csv |
Where-Object { $_.machine_type -eq 'workstations' } |
Export-Csv $outputFile -NoType
Solution 2 :
$inputFolder = "C:\Change\2019\October"
$outputFile = "C:\Change\2019\output.csv"
$encoding = [System.Text.Encoding]::UTF8 # modify encoding if necessary
$Delimiter=','
#find header for your files => i take first row of first file with data
$Header = Get-ChildItem -Path $inputFolder -Filter *.csv | Where length -gt 0 | select -First 1 | Get-Content -TotalCount 1
#if not header founded then not file with sise >0 => we quit
if(! $Header) {return}
#create array for header
$HeaderArray=$Header -split $Delimiter -replace '"', ''
#open output file
$w = New-Object System.IO.StreamWriter($outputfile, $true, $encoding)
#write header founded
$w.WriteLine($Header)
#loop on file csv
Get-ChildItem $inputFolder -File -Filter "*.csv" | %{
#open file for read
$r = New-Object System.IO.StreamReader($_.fullname, $encoding)
$skiprow = $true
while ($line = $r.ReadLine())
{
#exclude header
if ($skiprow)
{
$skiprow = $false
continue
}
#Get objet for current row with header founded
$Object=$line | ConvertFrom-Csv -Header $HeaderArray -Delimiter $Delimiter
#write in output file for your condition asked
if ($Object.machine_type -eq 'workstations') { $w.WriteLine($line) }
}
$r.Close()
$r.Dispose()
}
$w.close()
$w.Dispose()
You have to read and write to the .csv files one row at a time, using StreamReader and StreamWriter:
$filepath = "C:\Change\2019\October"
$outputfile = "C:\Change\2019\output.csv"
$encoding = [System.Text.Encoding]::UTF8
$files = Get-ChildItem -Path $filePath -Filter *.csv |
Where-Object { $_.machine_type -eq 'workstations' }
$w = New-Object System.IO.StreamWriter($outputfile, $true, $encoding)
$skiprow = $false
foreach ($file in $files)
{
$r = New-Object System.IO.StreamReader($file.fullname, $encoding)
while (($line = $r.ReadLine()) -ne $null)
{
if (!$skiprow)
{
$w.WriteLine($line)
}
$skiprow = $false
}
$r.Close()
$r.Dispose()
$skiprow = $true
}
$w.close()
$w.Dispose()
get-content *.csv | add-content combined.csv
Make sure combined.csv doesn't exist when you run this, or it's going to go full Ouroboros.

batch parsing data after a character/text from a word file in powershell

I am processing a batch of word documents.
I've successfully been able to extract the data that I wanted to a text file using this code:
$info = gci 'C:\Users\xxx\xxx' -Recurse -File *.doc -Include *lol -Exclude *poo*) | ForEach-Object {
Get-Content ($_.fullName ) | Where-Object { $_.Contains("Date:")}
Get-Content ($_.fullName ) | Where-Object { $_.Contains("Name:")}
}
$info > C:\Users\xxx.txt
This creates a text file like this-
Date: 1/11/2011
Name Joe Shmoe
For each found document...
I would like to remove the "Date:" and "Name:" part of the output for later extraction to an Excel file.
I've tried multiple methods using the $name.Split(':') followed by $name2 = $name.Substring($name.IndexOf(':') +1) and returning $name2 Heck, I've tried a ton of things. The best I could get was a complete iteration through each of the 100 files (with different names/dates) but only one name and date was returned 100 times. Could someone please help me out with this? Thank you!
This should accomplish your goal:
#Requires -Version 3
$Params = #{
Path = 'C:\Users\xx\xx'
Filter = '*.doc'
Include = '*lol'
Exclude = '*poo*'
File = $True
Recurse = $True
}
Get-ChildItem #Params |
ForEach-Object {
(Get-Content -Path $_.FullName |
Where-Object { $_ -match '(date)|(name):' }) -replace '(date)|(name):'
} |
Out-File -FilePath 'C:\Users\xxx.txt'

Using Array and get-childitem to find filenames with specific ids

In the most basic sense, I have a SQL query which returns an array of IDs, which I've stored into a variable $ID. I then want to perform a Get-childitem on a specific folder for any filenames that contain any of the IDs in said variable ($ID) There are three possible filenames that could exist:
$ID.xml
$ID_input.xml
$ID_output.xml
Once I have the results of get-childitem, I want to output this as a text file and delete the files from the folder. The part I'm having trouble with is filtering the results of get-childitem to define the filenames I'm looking for, so that only files that contain the IDs from the SQL output are displayed in my get-childitem results.
I found another way of doing this, which works fine, by using for-each ($i in $id), then building the desired filenames from that and performing a remove item on them:
# Build list of XML files
$XMLFile = foreach ($I in $ID)
{
"$XMLPath\$I.xml","$XMLPath\$I`_output.xml","$XMLPath\$I`_input.xml"
}
# Delete XML files
$XMLFile | Remove-Item -Force
However, this produces a lot of errors in the shell, as it tries to delete files that don't exist, but whose IDs do exist in the database. I also can't figure out how to produce a text output of the files that were actually deleted, doing it this way, so I'd like to get back to the get-childitem approach, if possible.
Any ideas would be greatly appreciated. If you require more info, just ask.
You can find all *.xml files with Get-ChildItem to minimize the number of files to test and then use regex to match the filenames. It's faster than a loop/multiple test, but harder to read if you're not familiar with regex.
$id = 123,111
#Create regex-pattern (search-pattern)
$regex = "^($(($id | ForEach-Object { [regex]::Escape($_) }) -join '|'))(?:_input|_output)?$"
$filesToDelete = Get-ChildItem -Path "c:\users\frode\Desktop\test" -Filter "*.xml" | Where-Object { $_.BaseName -match $regex }
#Save list of files
$filesToDelete | Select-Object -ExpandProperty FullName | Out-File "deletedfiles.txt" -Append
#Remove files (remove -WhatIf when ready)
$filesToDelete | Remove-Item -Force -WhatIf
Regex demo: https://regex101.com/r/dS2dJ5/2
Try this:
clear
$ID = "a", "b", "c"
$filesToDelete = New-Object System.Collections.ArrayList
$files = Get-ChildItem e:\
foreach ($I in $ID)
{
($files | Where-object { $_.Name -eq "$ID.xml" }).FullName | ForEach-Object { $filesToDelete.Add($_) }
($files | Where-object { $_.Name -eq "$ID_input.xml" }).FullName | ForEach-Object { $filesToDelete.Add($_) }
($files | Where-object { $_.Name -eq "$ID_output.xml" }).FullName | ForEach-Object { $filesToDelete.Add($_) }
}
$filesToDelete | select-object -Unique | ForEach-Object { Remove-Item $_ -Force }

Using Powershell to replace multiple strings in multiple files & folders

I have a list of strings in a CSV file. The format is:
OldValue,NewValue
223134,875621
321321,876330
....
and the file contains a few hundred rows (each OldValue is unique). I need to process changes over a number of text files in a number of folders & subfolders. My best guess of the number of folders, files, and lines of text are - 15 folders, around 150 text files in each folder, with approximately 65,000 lines of text in each folder (between 400-500 lines per text file).
I will make 2 passes at the data, unless I can do it in one. First pass is to generate a text file I will use as a check list to review my changes. Second pass is to actually make the change in the file. Also, I only want to change the text files where the string occurs (not every file).
I'm using the following Powershell script to go through the files & produce a list of the changes needed. The script runs, but is beyond slow. I haven't worked on the replace logic yet, but I assume it will be similar to what I've got.
# replace a string in a file with powershell
[reflection.assembly]::loadwithpartialname("Microsoft.VisualBasic") | Out-Null
Function Search {
# Parameters $Path and $SearchString
param ([Parameter(Mandatory=$true, ValueFromPipeline = $true)][string]$Path,
[Parameter(Mandatory=$true)][string]$SearchString
)
try {
#.NET FindInFiles Method to Look for file
[Microsoft.VisualBasic.FileIO.FileSystem]::GetFiles(
$Path,
[Microsoft.VisualBasic.FileIO.SearchOption]::SearchAllSubDirectories,
$SearchString
)
} catch { $_ }
}
if (Test-Path "C:\Work\ListofAllFilenamesToSearch.txt") { # if file exists
Remove-Item "C:\Work\ListofAllFilenamesToSearch.txt"
}
if (Test-Path "C:\Work\FilesThatNeedToBeChanged.txt") { # if file exists
Remove-Item "C:\Work\FilesThatNeedToBeChanged.txt"
}
$filefolder1 = "C:\TestFolder\WorkFiles"
$ftype = "*.txt"
$filenames1 = Search $filefolder1 $ftype
$filenames1 | Out-File "C:\Work\ListofAllFilenamesToSearch.txt" -Width 2000
if (Test-Path "C:\Work\FilesThatNeedToBeChanged.txt") { # if file exists
Remove-Item "C:\Work\FilesThatNeedToBeChanged.txt"
}
(Get-Content "C:\Work\NumberXrefList.CSV" |where {$_.readcount -gt 1}) | foreach{
$OldFieldValue, $NewFieldValue = $_.Split("|")
$filenamelist = (Get-Content "C:\Work\ListofAllFilenamesToSearch.txt" -ReadCount 5) #|
foreach ($j in $filenamelist) {
#$testvar = (Get-Content $j )
#$testvar = (Get-Content $j -ReadCount 100)
$testvar = (Get-Content $j -Delimiter "\n")
Foreach ($i in $testvar)
{
if ($i -imatch $OldFieldValue) {
$j + "|" + $OldFieldValue + "|" + $NewFieldValue | Out-File "C:\Work\FilesThatNeedToBeChanged.txt" -Width 2000 -Append
}
}
}
}
$FileFolder = (Get-Content "C:\Work\FilesThatNeedToBeChanged.txt" -ReadCount 5)
Get-ChildItem $FileFolder -Recurse |
select -ExpandProperty fullname |
foreach {
if (Select-String -Path $_ -SimpleMatch $OldFieldValue -Debug -Quiet) {
(Get-Content $_) |
ForEach-Object {$_ -replace $OldFieldValue, $NewFieldValue }|
Set-Content $_ -WhatIf
}
}
In the code above, I've tried several things with Get-Content - default, with -ReadCount, and -Delimiter - in an attempt to avoid an out of memory error.
The only thing I have control over is the length of the old & new replacement strings file. Is there a way to do this in Powershell? Is there a better option/solution? I'm running Windows 7, Powershell version 3.0.
Your main problem is that you're reading the file over and over again to change each of the terms. You need to invert the looping of the replace terms and looping of the files. Also, pre-load the csv. Something like:
$filefolder1 = "C:\TestFolder\WorkFiles"
$ftype = "*.txt"
$filenames = gci -Path $filefolder1 -Filter $ftype -Recurse
$replaceValues = Import-Csv -Path "C:\Work\NumberXrefList.CSV"
foreach ($file in $filenames) {
$contents = Get-Content -Path $file
foreach ($replaceValue in $replaceValues) {
$contents = $contents -replace $replaceValue.OldValue, $replaceValue.NewValue
}
Copy-Item $file "$file.old"
Set-Content -Path $file -Value $contents
}

Need to add the full path of where test was referenced from

So far I have a hash table with 2 values in it. Right now the code below, exports all the unique lines and gives me a count of how many times the line was referenced in 100's of xml files. This is one part.
I now need to find out which subfolder had the xml file in it that has that unique line of referenced in the hash table. Is this possible?
$ht = #{}
Get-ChildItem -recurse -Filter *.xml | Get-Content | %{$ht[$_] = $ht[$_]+1}
$ht
# To export to CSV:
$ht.GetEnumerator() | select key, value | Export-Csv D:\output.csv
To get file path to your output, you need to assign it to a variable in the first pipe.
Is this something similar to what you need?
$ht = #{}
Get-ChildItem -recurse -Filter *.xml | %{$path = $_.FullName; Get-Content $path} | % { $ht[$_] = $ht[$_] + $path + ";"}
The code above will return a hash-table in "config line" = "count" format.
EDIT:
If you need to return three elements (unique line, count and array of paths where it was found) it gets more complicated. Here is a code that will return an array of PSObjects. Each contains info for one unique line in XML files.
$ht = #()
$files = Get-ChildItem -recurse -Filter *.xml
foreach ($file in $files) {
$path = $file.FullName
$lines = Get-Content $path
foreach ($line in $lines) {
if ($match = $ht | where {$_.line -EQ $line}) {
$match.count = $match.count + 1
$match.Paths += $path
} else {
$ht += new-object PSObject -Property #{
Count = 1
Paths = #(,$path)
Line = $line }
}
}
}
$ht
I'm sure it can be shortened and optimized, but hopefully it is enough to get you started.