PowerShell clearing memory after finishing - powershell

I have a PowerShell script that reads a large CSV file (4GB+), finds certain lines, then writes the lines to other files.
I'm noticing that when it gets to "echo "Processed $datacounter total lines in the $datafile file"" the last line of the script, it doesn't actually finish until 5-10 minutes later.
What is it doing for that period? When it does finish, memory usage drops off significantly. Is there a way to force it to clear memory at the end of the script?
Screenshot of Memory Usage
Screenshot of script timestamps
Here is the final version of my script for reference.
# Get the filename
$datafile = Read-Host "Filename"
$dayofweek = Read-Host "Day of week (IE 1 = Monday, 2 = Tuesday..)"
$campaignWriters = #{}
# Create campaign ID hash table
$campaignByID = #{}
foreach($c in (Import-Csv 'campaigns.txt' -Delimiter '|')) {
foreach($id in ($c.CampaignID -split ' ')) {
$campaignByID[$id] = $c.CampaignName
}
foreach($cname in ($c.CampaignName)) {
$writer = $campaignWriters[$cname] = New-Object IO.StreamWriter($dayofweek + $cname + '_filtered.txt')
if($dayofweek -eq 1) {
$writer.WriteLine("ID1|ID2|ID3|ID4|ID5|ID6|Time|Time-UTC-Sec")
}
}
}
# Display the campaigns
$campaignByID.GetEnumerator() | Sort-Object Value
# Read in data file
$encoding = [Text.Encoding]::GetEncoding('iso-8859-1')
$datareader = New-Object IO.StreamReader($datafile, $encoding)
$datacounter = 0
echo "Starting.."
get-date -Format g
while (!$datareader.EndOfStream) {
$data = $datareader.ReadLine().Split('þ')
# Find the Campaign in the hashtable
$campaignName = $campaignByID[$data[3]]
if($campaignName) {
$writer = $campaignWriters[$campaignName]
# If a campaign name was returned from the hash, add the line using that campaign's writer
$writer.WriteLine(($data[20,3,5,8,12,14,0,19] -join '|'))
}
$datacounter++;
}
$datareader.Close()
foreach ($writer in $campaignWriters.Values) {
$writer.Close()
}
echo "Done!"
get-date -Format g
echo "Processed $datacounter total lines in the $datafile file"

I'm assuming that campaigns.txt is the mult-gigabyte file you are referring to. If it's the other file(s), this might not make as much sense.
If so, invoking import-csv the inside parenthesis then using the foreach statement to iterate through them is what's driving your memory usage so high. A better alternative would be use a PowerShell pipeline to stream records from the file without needing to keep all of them in memory at the same time. You achieve this by changing the foreach statment into a ForEach-Object cmdlet:
Import-Csv 'campaigns.txt' -Delimiter '|' | ForEach-Object {
foreach($id in ($_.CampaignID -split ' ')) {
$campaignByID[$id] = $_.CampaignName
}
}
The .NET garbage collector is optimized cases where the majority of objects are short-lived. Therefor this change should result in a noticeable performance increase, as well as reduced wind-down time at the end.
I advise against forcing garbage collection with [System.GC]::Collect(), the garbage collector knows best when it should run. The reasons for this are complex, if you really want to know details why this is true, Maoni's blog has a wealth of details about garbage collection in the .NET environment.

It may or may not work, but you can try to tell garbage collection to run:
[System.GC]::Collect()
You don't have fine grained control over it though, and it may help to Remove-Variable or set variables to $null for some things before running it so that there aren't references to the data anymore.

Related

Check if a condition is met by a line within a TXT but "in an advanced way"

I have a TXT file with 1300 megabytes (huge thing). I want to build code that does two things:
Every line contains a unique ID at the beginning. I want to check for all lines with the same unique ID if the conditions is met for that "group" of IDs. (This answers me: For how many lines with the unique ID X have all conditions been met)
If the script is finished I want to remove all lines from the TXT where the condition was met (see 2). So I can rerun the script with another condition set to "narrow down" the whole document.
After few cycles I finally have a set of conditions that applies to all lines in the document.
It seems that my current approach is very slow.( one cycle needs hours). My final result is a set of conditions that apply to all lines of code.
If you find an easier way to do that, feel free to recommend.
Help is welcome :)
Code so far (does not fullfill everything from 1&2)
foreach ($item in $liste)
{
# Check Conditions
if ( ($item -like "*XXX*") -and ($item -like "*YYY*") -and ($item -notlike "*ZZZ*")) {
# Add a line to a document to see which lines match condition
Add-Content "C:\Desktop\it_seems_to_match.txt" "$item"
# Retrieve the unique ID from the line and feed array.
$array += $item.Split("/")[1]
# Remove the line from final document
$liste = $liste -replace $item, ""
}
}
# Pipe the "new cleaned" list somewhere
$liste | Set-Content -Path "C:\NewListToWorkWith.txt"
# Show me the counts
$array | group | % { $h = #{} } { $h[$_.Name] = $_.Count } { $h } | Out-File "C:\Desktop\count.txt"
Demo Lines:
images/STRINGA/2XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg images/STRINGA/3XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg images/STRINGB/4XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg images/STRINGB/5XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg images/STRINGC/5XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg
performance considerations:
Add-Content "C:\Desktop\it_seems_to_match.txt" "$item"
try to avoid wrapping cmdlet pipelines
See also: Mastering the (steppable) pipeline
$array += $item.Split("/")[1]
Try to avoid using the increase assignment operator (+=) to create a collection
See also: Why should I avoid using the increase assignment operator (+=) to create a collection
$liste = $liste -replace $item, ""
This is a very expensive operation considering that you are reassigning (copying) a long list ($liste) with each iteration.
Besides it is a bad practice to change an array that you are currently iterating.
$array | group | ...
Group-Object is a rather slow cmdlet, you better collect (or count) the items on-the-fly (where you do $array += $item.Split("/")[1]) using a hashtable, something like:
$Name = $item.Split("/")[1]
if (!$HashTable.Contains($Name)) { $HashTable[$Name] = [Collections.Generic.List[String]]::new() }
$HashTable[$Name].Add($Item)
To minimize memory usage it may be better to read one line at a time and check if it already exists. Below code I used StringReader and you can replace with StreamReader for reading from a file. I'm checking if the entire string exists, but you may want to split the line. Notice I have duplicaes in the input but not in the dictionary. See code below :
$rows= #"
images/STRINGA/2XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg
images/STRINGA/3XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg
images/STRINGB/4XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg
images/STRINGB/5XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg
images/STRINGC/5XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg
images/STRINGA/2XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg
images/STRINGA/3XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg
images/STRINGB/4XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg
images/STRINGB/5XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg
images/STRINGC/5XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg
"#
$dict = [System.Collections.Generic.Dictionary[int, System.Collections.Generic.List[string]]]::new();
$reader = [System.IO.StringReader]::new($rows)
while(($row = $reader.ReadLine()) -ne $null)
{
$hash = $row.GetHashCode()
if($dict.ContainsKey($hash))
{
#check if list contains the string
if($dict[$hash].Contains($row))
{
#string is a duplicate
}
else
{
#add string to dictionary value if it is not in list
$list = $dict[$hash].Value
$list.Add($row)
}
}
else
{
#add new hash value to dictionary
$list = [System.Collections.Generic.List[string]]::new();
$list.Add($row)
$dict.Add($hash, $list)
}
}
$dict

How to speed up processing of ~million lines of text in log file

I am trying to parse a very large log file that consists of space delimited text across about 16 fields. Unfortunately the app logs a blank line in between each legitimate one (arbitrarily doubling the lines I must process). It also causes fields to shift because it uses space as both a delineator as well as for empty fields. I couldn't get around this in LogParser. Fortunately Powershell affords me the option to reference fields from the end as well making it easier to get later fields affected by shift.
After a bit of testing with smaller sample files, I've determined that processing line by line as the file is streaming with Get-Content natively is slower than just reading the file completely using Get-Content -ReadCount 0 and then processing from memory. This part is relatively fast (<1min).
The problem comes when processing each line, even though it's in memory. It is taking hours for a 75MB file with 561178 legitimate lines of data (minus all the blank lines).
I'm not doing much in the code itself. I'm doing the following:
Splitting line via space as delimiter
One of the fields is an IP address that I am reverse DNS resolving, which is obviously going to be slow. So I have wrapped this into more code to create an in-memory arraylist cache of previously resolved IPs and pulling from it when possible. The IPs are largely the same so after a few hundred lines, resolution shouldn't be an issue any longer.
Saving the needed array elements into my pscustomobject
Adding pscustomobject to arraylist to be used later.
During the loop I'm tracking how many lines I've processed and outputting that info in a progress bar (I know this adds extra time but not sure how much). I really want to know progress.
All in all, it's processing some 30-40 lines per second, but obviously this is not very fast.
Can someone offer alternative methods/objectTypes to accomplish my goals and speed this up tremendously?
Below are some samples of the log with the field shift (Note this is a Windows DNS Debug log) as well as the code below that.
10/31/2022 12:38:45 PM 2D00 PACKET 000000B25A583FE0 UDP Snd 127.0.0.1 6c94 R Q [8385 A DR NXDOMAIN] AAAA (4)pool(3)ntp(3)org(0)
10/31/2022 12:38:45 PM 2D00 PACKET 000000B25A582050 UDP Snd 127.0.0.1 3d9d R Q [8081 DR NOERROR] A (4)pool(3)ntp(3)org(0)
NOTE: the issue in this case being [8385 A DR NXDOMAIN] (4 fields) vs [8081 DR NOERROR] (3 fields)
Other examples would be the "R Q" where sometimes it's " Q".
$Logfile = "C:\Temp\log.txt"
[System.Collections.ArrayList]$LogEntries = #()
[System.Collections.ArrayList]$DNSCache = #()
# Initialize log iteration counter
$i = 1
# Get Log data. Read entire log into memory and save only lines that begin with a date (ignoring blank lines)
$LogData = Get-Content $Logfile -ReadCount 0 | % {$_ | ? {$_ -match "^\d+\/"}}
$LogDataTotalLines = $LogData.Length
# Process each log entry
$LogData | ForEach-Object {
$PercentComplete = [math]::Round(($i/$LogDataTotalLines * 100))
Write-Progress -Activity "Processing log file . . ." -Status "Processed $i of $LogDataTotalLines entries ($PercentComplete%)" -PercentComplete $PercentComplete
# Split line using space, including sequential spaces, as delimiter.
# NOTE: Due to how app logs events, some fields may be blank leading split yielding different number of columns. Fortunately the fields we desire
# are in static positions not affected by this, except for the last 2, which can be referenced backwards with -2 and -1.
$temp = $_ -Split '\s+'
# Resolve DNS name of IP address for later use and cache into arraylist to avoid DNS lookup for same IP as we loop through log
If ($DNSCache.IP -notcontains $temp[8]) {
$DNSEntry = [PSCustomObject]#{
IP = $temp[8]
DNSName = Resolve-DNSName $temp[8] -QuickTimeout -DNSOnly -ErrorAction SilentlyContinue | Select -ExpandProperty NameHost
}
# Add DNSEntry to DNSCache collection
$DNSCache.Add($DNSEntry) | Out-Null
# Set resolved DNS name to that which came back from Resolve-DNSName cmdlet. NOTE: value could be blank.
$ResolvedDNSName = $DNSEntry.DNSName
} Else {
# DNSCache contains resolved IP already. Find and Use it.
$ResolvedDNSName = ($DNSCache | ? {$_.IP -eq $temp[8]}).DNSName
}
$LogEntry = [PSCustomObject]#{
Datetime = $temp[0] + " " + $temp[1] + " " + $temp[2] # Combines first 3 fields Date, Time, AM/PM
ClientIP = $temp[8]
ClientDNSName = $ResolvedDNSName
QueryType = $temp[-2] # Second to last entry of array
QueryName = ($temp[-1] -Replace "\(\d+\)",".") -Replace "^\.","" # Last entry of array. Replace any "(#)" characters with period and remove first period for friendly name
}
# Add LogEntry to LogEntries collection
$LogEntries.Add($LogEntry) | Out-Null
$i++
}
Here is a more optimized version you can try.
What changed?:
Removed Write-Progress, especially because it's not known if Windows PowerShell is used. PowerShell versions below 6 have a big performance impact with Write-Progress
Changed $DNSCache to Generic Dictionary for fast lookups
Changed $LogEntries to Generic List
Switched from Get-Content to switch -Regex -File
$Logfile = 'C:\Temp\log.txt'
$LogEntries = [System.Collections.Generic.List[psobject]]::new()
$DNSCache = [System.Collections.Generic.Dictionary[string, psobject]]::new([System.StringComparer]::OrdinalIgnoreCase)
# Process each log entry
switch -Regex -File ($Logfile) {
'^\d+\/' {
# Split line using space, including sequential spaces, as delimiter.
# NOTE: Due to how app logs events, some fields may be blank leading split yielding different number of columns. Fortunately the fields we desire
# are in static positions not affected by this, except for the last 2, which can be referenced backwards with -2 and -1.
$temp = $_ -Split '\s+'
$ip = [string] $temp[8]
$resolvedDNSRecord = $DNSCache[$ip]
if ($null -eq $resolvedDNSRecord) {
$resolvedDNSRecord = [PSCustomObject]#{
IP = $ip
DNSName = Resolve-DnsName $ip -QuickTimeout -DnsOnly -ErrorAction Ignore | select -ExpandProperty NameHost
}
$DNSCache[$ip] = $resolvedDNSRecord
}
$LogEntry = [PSCustomObject]#{
Datetime = $temp[0] + ' ' + $temp[1] + ' ' + $temp[2] # Combines first 3 fields Date, Time, AM/PM
ClientIP = $ip
ClientDNSName = $resolvedDNSRecord.DNSName
QueryType = $temp[-2] # Second to last entry of array
QueryName = ($temp[-1] -Replace '\(\d+\)', '.') -Replace '^\.', '' # Last entry of array. Replace any "(#)" characters with period and remove first period for friendly name
}
# Add LogEntry to LogEntries collection
$LogEntries.Add($LogEntry)
}
}
If it's still slow, there is still the option to use Start-ThreadJob as a multithreading approach with chunked lines (like 10000 per job).

Fast compare two large csv(boths rows and columns) in powershell

I have two large CSVs to compare. Bosth csvs are basically data from the same system 1 day apart. No of rows are around 12k and columns 30.
The aim is to identify what column data has changed for primary key(#ID).
My idea was to loop through the CSVs to identify which rows have changed and dump these into a separate csvs. One done, I again loop through the changes rows, and indetify the exact change in column.
NewCSV = Import-Csv -Path ".\Data_A.csv"
OldCSV = Import-Csv -Path ".\Data_B.csv"
foreach ($LineNew in $NewCSV)
{
ForEach ($LineOld in $OldCSV)
{
If($LineNew -eq $LineOld)
{
Write-Host $LineNew, " Match"
}else{
Write-Host $LineNew, " Not Match"
}
}
}
But as soon as run the loop, it takes forever to run for 12k rows. I was hoping there must be a more efficient way to compare large files powershell. Something that is quicker.
Well you can give this a try, I'm not claiming it will be fast for what vonPryz has already pointed out but it should give you a good side-by-side perspective to compare what has changed from OldCsv to NewCsv.
Note: Those cells that have the same value on both CSVs will be ignored.
$NewCSV = Import-Csv -Path ".\Data_A.csv"
$OldCSV = Import-Csv -Path ".\Data_B.csv" | Group-Object ID -AsHashTable -AsString
$properties = $newCsv[0].PSObject.Properties.Name
$result = foreach($line in $NewCSV)
{
if($ref = $OldCSV[$line.ID])
{
foreach($prop in $properties)
{
if($line.$prop -ne $ref.$prop)
{
[pscustomobject]#{
ID = $line.ID
Property = $prop
OldValue = $ref.$prop
NewValue = $line.$prop
}
}
}
continue
}
Write-Warning "ID $($line.ID) could not be found on Old Csv!!"
}
As vonPryz hints in the comments, you've written an algorithm with quadratic time complexity (O(n²) in Big-O notation) - every time the input size doubles, the number of computations performed increase 4-fold.
To avoid this, I'd suggest using a hashtable or other dictionary type to hold each data set, and use the primary key from the input as the dictionary key. This way you get constant-time lookup of corresponding records, and the time complexity of your algorithm becomes near-linear (O(2n + k)):
$NewCSV = #{}
Import-Csv -Path ".\Data_A.csv" |ForEach-Object {
$NewCSV[$_.ID] = $_
}
$OldCSV = #{}
Import-Csv -Path ".\Data_B.csv" |ForEach-Object {
$OldCSV[$_.ID] = $_
}
Now that we can efficiently resolve each row by it's ID, we can inspect the whole of the data sets with an independent loop over each:
foreach($entry in $NewCSV.GetEnumerator()){
if(-not $OldCSV.ContainsKey($entry.Key)){
# $entry.Value is a new row, not seen in the old data set
}
$newRow = $entry.Value
$oldRow = $OldCSV[$entry.Key]
# do the individual comparison of the rows here
}
Do another loop like above, but with $NewCSV in place of $OldCSV to find/detect deletions.

Run command, extract a field, run a resultant command

Apologies if this is an insanely simple question, but I'm at something at a loss.
What I'm trying to do is take a command output - in this case from NetApp DFM:
dfm event list
ID Source Name Severity Timestamp
------- ------- ------------- ----------- ------------
1 332 volume-online Normal 20 Apr 10:16
2 443 volume-online Normal 20 Apr 10:17
3 3222 volume-online Normal 20 Apr 10:18
I have about 17,000 events - I want to delete them all by ID, by running:
dfm event delete <ID>
I know exactly how I'd do this on Unix (and used to, when this was our platform):
for i in `dfm event list | awk '{print $1}'`
do
dfm event delete $i
done
For bonus points - a 'grep' type criteria? I apologise in advance for the basic nature of the question - I've tried looking on Google for a suitable example, but haven't found anything.
I've made a start by:
dfm event list > dfmevent.txt
foreach ( $line in get-content dfmevent.txt ) {
echo $line
}
But I thought I would ask if there's a better way.
I don't have access to your environment to test but if you are just trying to get access to that first element which is the ID then that should be straight forward.
dfm event list | ForEach-Object{$_.Split(" ",2)[0]} | Where-Object{$_ -match '^\d+$'} | ForEach-Object{
#For Testing
Write-Host "Id: $_ will be deleted"
# Then do something
# dfm event delete $_
}
I'm sure the output is already delimited with new line so sending to file might be redundant.
We take each line and try and split it on the first space. Then pass the first element from that array. Next we ensure that element is indeed a number with a simple regex check. This will ensure that we only get numbers. I had thought about skipping the first two lines but this should work for other occurrences of text as well.
The last loop is for processing that ID. I left a Write-Host there for testing. Assuming you get the id's you are looking for you should just be able to uncomment out that last line with dfm event delete $_
Capturing the output of a DOS command into Powershell is a challenge.
Using a native snapin or module from NetApp would be easier.
might be worth checking out if that link helps
Otherwise, your method of writing to a text file and reading it back in is actually quite a good idea, this is one way of reading it back and pushing the data into the command you need.
$a = get-content dfmevent.txt
foreach ($i in $a) { if ($i.ReadCount -gt 2) { dfm event delete ($i.Substring(0,$i.IndexOf(" "))) } }
This will assign to the variable $result only
$a = get-content dfmevent.txt
$result = #()
foreach ($i in $a) { if ($i.ReadCount -gt 2) { $result += $i.Substring(0,$i.IndexOf(" "))} }
And if you did not want to write to a text file, you could use the .NET method of capturing the output directly
$ProcessInfo = New-Object System.Diagnostics.ProcessStartInfo
$ProcessInfo.FileName = "dfm"
$ProcessInfo.RedirectStandardOutput = $true
$ProcessInfo.UseShellExecute = $false
$ProcessInfo.Arguments = "event list"
$Process = New-Object System.Diagnostics.Process
$Process.StartInfo = $ProcessInfo
$Process.Start() | Out-Null
$Process.WaitForExit()
$output = $Process.StandardOutput.ReadToEnd()

Understanding performance impact of "write-output"

I'm writing a Powershell script (PS version 4) to parse and process IIS log files, and I've come across an issue I don't quite understand: write-output seems to add significant processing time to the script. The core of it is this (there is more, but this demonstrates the issue):
$file = $args[0]
$linecount = 0
$start = [DateTime]::Now
$reader = [IO.File]::OpenText($file)
while ($reader.Peek() -ge 0) {
$line = $reader.ReadLine()
$linecount++
if (0 -eq ($linecount % 10000)) {
$batch = [DateTime]::Now
[Console]::Error.WriteLine(" Processed $linecount lines ($file) in $(($batch - $start).TotalMilliseconds)ms")
$start = $batch
}
$parts = $line.split(' ')
$out = "$file,$($parts[0]) $($parts[1]),$($parts[2]),$($parts[3]),$($parts[4]),$($parts[5]),$($parts[6]),$($parts[7])"
## Send the output out - comment in/out the desired output method
## Direct output - roughly 10,000 lines / 880ms
$out
## Via write-output - roughly 10,000 lines / 1500ms
write-output $out
}
$reader.Close()
Invoked as .\script.ps1 {path_to_340,000_line_IIS_log} > bob.txt; progress/performance timings are given on stderr.
The script above has two output lines - the write-output one is reporting 10,000 lines every 1500ms, whereas the line that does not have write-output takes as good as half that, averaging about 880ms per 10,000 lines.
I thought that an object defaulted to write-output if it had no other thing (i.e., I thought that "bob" was equivalent to write-output "bob"), but the times I'm getting argue against this.
What am I missing here?
Just a guess, but:
Looking at the help on write-output
Write-Output [-InputObject] <PSObject[]> [-NoEnumerate] [<CommonParameters>]
You're giving it an list of objects as an argument, so it's having to spend a little time assembling to them into an array internally before it does the write, whereas simply outputting them just streams them to the pipeline immediately. You could pipe them to Write-Object, but that's going to add another pipeline which might be even worse.
Edit
In addition you'll find that it's adding .062ms per operation (1500 -880)/10000. You have to scale that to very large data sets before it becomes noticeable.