Optimizing a script - powershell

Info
I've created a script which analyzes the debug logs from Windows DNS Server.
It does the following:
Open debug log using [System.IO.File] class
Perform a regex match on each line
Separate 16 capture groups into different properties inside a custom object
Fills dictionaries and appends to the value of each key to produce statistics
Steps 1 and 2 take the longest. In fact, they take a seemingly endless amount of time, because the file is growing as it is being read.
Problem
Due to the size of the debug log (80,000kb) it takes a very long time.
I believe that my code is fine for smaller text files, but it fails to deal with much larger files.
Code
Here is my code: https://github.com/cetanu/msDnsStats/blob/master/msdnsStats.ps1
Debug log preview
This is what the debug looks like (including the blank lines)
Multiply this by about 100,000,000 and you have my debug log.
21/03/2014 2:20:03 PM 0D0C PACKET 0000000005FCB280 UDP Rcv 202.90.34.177 3709 Q [1001 D NOERROR] A (2)up(13)massrelevance(3)com(0)
21/03/2014 2:20:03 PM 0D0C PACKET 00000000042EB8B0 UDP Rcv 67.215.83.19 097f Q [0000 NOERROR] CNAME (15)manchesterunity(3)org(2)au(0)
21/03/2014 2:20:03 PM 0D0C PACKET 0000000003131170 UDP Rcv 62.36.4.166 a504 Q [0001 D NOERROR] A (3)ekt(4)user(7)net0319(3)com(0)
21/03/2014 2:20:03 PM 0D0C PACKET 00000000089F1FD0 UDP Rcv 80.10.201.71 3e08 Q [1000 NOERROR] A (4)dns1(5)offis(3)com(2)au(0)
Request
I need ways or ideas on how to open and read each line of a file more quickly than what I am doing now.
I am open to suggestions of using a different language.

I would trade this:
$dnslog = [System.IO.File]::Open("c:\dns.log","Open","Read","ReadWrite")
$dnslog_content = New-Object System.IO.StreamReader($dnslog)
For ($i=0;$i -lt $dnslog.length; $i++)
{
$line = $dnslog_content.readline()
if ($line -eq $null) { continue }
# REGEX MATCH EACH LINE OF LOGFILE
$pattern = $line | select-string -pattern $regex
# IGNORE EMPTY MATCH
if ($pattern -eq $null) {
continue
}
for this:
Get-Content 'c:\dns.log' -ReadCount 1000 |
ForEach-Object {
foreach ($line in $_)
{
if ($line -match $regex)
{
#Process matches
}
}
That will reduce then number of file read operations by a factor of 1000.
Trading the select-string operation will require re-factoring the rest of the code to work with $matches[n] instead of $pattern.matches[0].groups[$n].value, but is much faster. Select-String returns matchinfo objects which contain a lot of additional information about the match (line number, filename, etc.) which is great if you need it. If all you need is strings from the captures then it's wasted effort.
You're creating an object ($log), and then accumulating values into array properties:
$log.date += #($pattern.matches[0].groups[$n].value); $n++
that array addition is going to kill your performance. Also, hash table operations are faster than object property updates.
I'd create $log as a hash table first, and the key values as array lists:
$log = #{}
$log.date = New-Object collections.arraylist
Then inside your loop:
$log.date.add($matches[1]) > $nul)
Then create your object from $log after you've populated all of the array lists.

As a general piece of advise, use the Measure-Command to find out which script blocks take the longest time.
That being said, the sleep process seems a bit weird. If I'm not in error, you sleep 20 ms after each row:
sleep -milliseconds 20
Multiply 20 ms with the log size, 100 million iterations, and you'll get quite a long total sleep time.
Try sleeping after some decent batch size. Try if 10 000 rows is good like so,
if($i % 10000 -eq 0) {
write-host -nonewline "."
start-sleep -milliseconds 20
}

Related

How to speed up processing of ~million lines of text in log file

I am trying to parse a very large log file that consists of space delimited text across about 16 fields. Unfortunately the app logs a blank line in between each legitimate one (arbitrarily doubling the lines I must process). It also causes fields to shift because it uses space as both a delineator as well as for empty fields. I couldn't get around this in LogParser. Fortunately Powershell affords me the option to reference fields from the end as well making it easier to get later fields affected by shift.
After a bit of testing with smaller sample files, I've determined that processing line by line as the file is streaming with Get-Content natively is slower than just reading the file completely using Get-Content -ReadCount 0 and then processing from memory. This part is relatively fast (<1min).
The problem comes when processing each line, even though it's in memory. It is taking hours for a 75MB file with 561178 legitimate lines of data (minus all the blank lines).
I'm not doing much in the code itself. I'm doing the following:
Splitting line via space as delimiter
One of the fields is an IP address that I am reverse DNS resolving, which is obviously going to be slow. So I have wrapped this into more code to create an in-memory arraylist cache of previously resolved IPs and pulling from it when possible. The IPs are largely the same so after a few hundred lines, resolution shouldn't be an issue any longer.
Saving the needed array elements into my pscustomobject
Adding pscustomobject to arraylist to be used later.
During the loop I'm tracking how many lines I've processed and outputting that info in a progress bar (I know this adds extra time but not sure how much). I really want to know progress.
All in all, it's processing some 30-40 lines per second, but obviously this is not very fast.
Can someone offer alternative methods/objectTypes to accomplish my goals and speed this up tremendously?
Below are some samples of the log with the field shift (Note this is a Windows DNS Debug log) as well as the code below that.
10/31/2022 12:38:45 PM 2D00 PACKET 000000B25A583FE0 UDP Snd 127.0.0.1 6c94 R Q [8385 A DR NXDOMAIN] AAAA (4)pool(3)ntp(3)org(0)
10/31/2022 12:38:45 PM 2D00 PACKET 000000B25A582050 UDP Snd 127.0.0.1 3d9d R Q [8081 DR NOERROR] A (4)pool(3)ntp(3)org(0)
NOTE: the issue in this case being [8385 A DR NXDOMAIN] (4 fields) vs [8081 DR NOERROR] (3 fields)
Other examples would be the "R Q" where sometimes it's " Q".
$Logfile = "C:\Temp\log.txt"
[System.Collections.ArrayList]$LogEntries = #()
[System.Collections.ArrayList]$DNSCache = #()
# Initialize log iteration counter
$i = 1
# Get Log data. Read entire log into memory and save only lines that begin with a date (ignoring blank lines)
$LogData = Get-Content $Logfile -ReadCount 0 | % {$_ | ? {$_ -match "^\d+\/"}}
$LogDataTotalLines = $LogData.Length
# Process each log entry
$LogData | ForEach-Object {
$PercentComplete = [math]::Round(($i/$LogDataTotalLines * 100))
Write-Progress -Activity "Processing log file . . ." -Status "Processed $i of $LogDataTotalLines entries ($PercentComplete%)" -PercentComplete $PercentComplete
# Split line using space, including sequential spaces, as delimiter.
# NOTE: Due to how app logs events, some fields may be blank leading split yielding different number of columns. Fortunately the fields we desire
# are in static positions not affected by this, except for the last 2, which can be referenced backwards with -2 and -1.
$temp = $_ -Split '\s+'
# Resolve DNS name of IP address for later use and cache into arraylist to avoid DNS lookup for same IP as we loop through log
If ($DNSCache.IP -notcontains $temp[8]) {
$DNSEntry = [PSCustomObject]#{
IP = $temp[8]
DNSName = Resolve-DNSName $temp[8] -QuickTimeout -DNSOnly -ErrorAction SilentlyContinue | Select -ExpandProperty NameHost
}
# Add DNSEntry to DNSCache collection
$DNSCache.Add($DNSEntry) | Out-Null
# Set resolved DNS name to that which came back from Resolve-DNSName cmdlet. NOTE: value could be blank.
$ResolvedDNSName = $DNSEntry.DNSName
} Else {
# DNSCache contains resolved IP already. Find and Use it.
$ResolvedDNSName = ($DNSCache | ? {$_.IP -eq $temp[8]}).DNSName
}
$LogEntry = [PSCustomObject]#{
Datetime = $temp[0] + " " + $temp[1] + " " + $temp[2] # Combines first 3 fields Date, Time, AM/PM
ClientIP = $temp[8]
ClientDNSName = $ResolvedDNSName
QueryType = $temp[-2] # Second to last entry of array
QueryName = ($temp[-1] -Replace "\(\d+\)",".") -Replace "^\.","" # Last entry of array. Replace any "(#)" characters with period and remove first period for friendly name
}
# Add LogEntry to LogEntries collection
$LogEntries.Add($LogEntry) | Out-Null
$i++
}
Here is a more optimized version you can try.
What changed?:
Removed Write-Progress, especially because it's not known if Windows PowerShell is used. PowerShell versions below 6 have a big performance impact with Write-Progress
Changed $DNSCache to Generic Dictionary for fast lookups
Changed $LogEntries to Generic List
Switched from Get-Content to switch -Regex -File
$Logfile = 'C:\Temp\log.txt'
$LogEntries = [System.Collections.Generic.List[psobject]]::new()
$DNSCache = [System.Collections.Generic.Dictionary[string, psobject]]::new([System.StringComparer]::OrdinalIgnoreCase)
# Process each log entry
switch -Regex -File ($Logfile) {
'^\d+\/' {
# Split line using space, including sequential spaces, as delimiter.
# NOTE: Due to how app logs events, some fields may be blank leading split yielding different number of columns. Fortunately the fields we desire
# are in static positions not affected by this, except for the last 2, which can be referenced backwards with -2 and -1.
$temp = $_ -Split '\s+'
$ip = [string] $temp[8]
$resolvedDNSRecord = $DNSCache[$ip]
if ($null -eq $resolvedDNSRecord) {
$resolvedDNSRecord = [PSCustomObject]#{
IP = $ip
DNSName = Resolve-DnsName $ip -QuickTimeout -DnsOnly -ErrorAction Ignore | select -ExpandProperty NameHost
}
$DNSCache[$ip] = $resolvedDNSRecord
}
$LogEntry = [PSCustomObject]#{
Datetime = $temp[0] + ' ' + $temp[1] + ' ' + $temp[2] # Combines first 3 fields Date, Time, AM/PM
ClientIP = $ip
ClientDNSName = $resolvedDNSRecord.DNSName
QueryType = $temp[-2] # Second to last entry of array
QueryName = ($temp[-1] -Replace '\(\d+\)', '.') -Replace '^\.', '' # Last entry of array. Replace any "(#)" characters with period and remove first period for friendly name
}
# Add LogEntry to LogEntries collection
$LogEntries.Add($LogEntry)
}
}
If it's still slow, there is still the option to use Start-ThreadJob as a multithreading approach with chunked lines (like 10000 per job).

Use Get content or Import-CSV to read 1st column in 2nd line in a csv

So I have a csv file which is 25MB.
I only need to get the value stored in 2nd line in first column and use it later in powershell script.
e.g data
File_name,INVNUM,ID,XXX....850 columns
ABCD,123,090,xxxx.....850 columns
ABCD,120,091,xxxx.....850 columns
xxxxxx5000+ rows
So my first column data is always the same and i just need to get this filename form the first column, 2nd row.
Should I try to use Get-content or Import-csv for this use case ?
Thanks,
Mickey
TessellatingHeckler's helpful answer contains a pragmatic, easy-to-understand solution that is most likely fast enough in practice; the same goes for Robert Cotterman's helpful answer which is concise (and also faster).
If performance is really paramount, you can try the following, which uses the .NET framework directly to read the lines - but given that you only need to read 2 lines, it's probably not worth it:
$inputFile = "$PWD/some.csv" # be sure to specify a *full* path
$isFirstLine=$true
$fname = foreach ($line in [IO.File]::ReadLines($inputFile)) {
if ($isFirstLine) { $isFirstLine = $false; continue } # skip header line
$line -replace '^([^,]*),.*', '$1' # extract 1st field from 2nd line and exit
break # exit
}
Note: A conceptually simpler way to extract the 1st field is to use ($line -split ',')[0], but with a large number of columns the above -replace-based approach is measurably faster.
Update: TessellatingHeckler offers 2 ways to speed up the above:
Use of $line.Substring(0, $line.IndexOf(',')) in lieu of $line -replace '^([^,]*),.*', '$1' in order to avoid relatively costly regex processing.
To lesser gain, use of a [System.IO.StreamReader] instance's .ReadLine() method twice in a row rather than [IO.File]::ReadLines() in a loop.
Here's a performance comparison of the approaches across all answers on this page (as of this writing).
To run it yourself, you must download functions New-CsvSampleData and Time-Command first.
For more representative results, the timings are averaged across 1,000 runs:
# Create sample CSV file 'test.csv' with 850 columns and 100 rows.
$testFileName = "test-$PID.csv"
New-CsvSampleData -Columns 850 -Count 100 | Set-Content $testFileName
# Compare the execution speed of the various approaches:
Time-Command -Count 1000 {
# Import-Csv
Import-Csv -LiteralPath $testFileName |
Select-Object -Skip 1 -First 1 -ExpandProperty 'col1'
}, {
# ReadLines(), -replace
$inputFile = $PWD.ProviderPath + "/$testFileName"
$isFirstLine=$true
foreach ($line in [IO.File]::ReadLines($inputFile)) {
if ($isFirstLine) { $isFirstLine = $false; continue } # skip header line
$line -replace '^([^,]*),.*', '$1' # extract 1st field from 2nd line and exit
break # exit
}
}, {
# ReadLines(), .Substring / IndexOf
$inputFile = $PWD.ProviderPath + "/$testFileName"
$isFirstLine=$true
foreach ($line in [IO.File]::ReadLines($inputFile)) {
if ($isFirstLine) { $isFirstLine = $false; continue } # skip header line
$line.Substring(0, $line.IndexOf(',')) # extract 1st field from 2nd line and exit
break # exit
}
}, {
# ReadLine() x 2, .Substring / IndexOf
$inputFile = $PWD.ProviderPath + "/$testFileName"
$f = [System.IO.StreamReader]::new($inputFile,$true);
$null = $f.ReadLine(); $line = $f.ReadLine()
$line.Substring(0, $line.IndexOf(','))
$f.Close()
}, {
# Get-Content -Head, .Split()
((Get-Content $testFileName -Head 2)[1]).split(',')[1]
} |
Format-Table Factor, Timespan, Command
Remove-Item $testFileName
Sample output from a single-core Windows 10 VM running Windows PowerShell v5.1 / PowerShell Core 6.1.0-preview.4 on a recent-model MacBook Pro:
Windows PowerShell v5.1:
Factor TimeSpan Command
------ -------- -------
1.00 00:00:00.0001922 # ReadLine() x 2, .Substring / IndexOf...
1.04 00:00:00.0002004 # ReadLines(), .Substring / IndexOf...
1.57 00:00:00.0003024 # ReadLines(), -replace...
3.25 00:00:00.0006245 # Get-Content -Head, .Split()...
25.83 00:00:00.0049661 # Import-Csv...
PowerShell Core 6.1.0-preview.4:
Factor TimeSpan Command
------ -------- -------
1.00 00:00:00.0001858 # ReadLine() x 2, .Substring / IndexOf...
1.03 00:00:00.0001911 # ReadLines(), .Substring / IndexOf...
1.60 00:00:00.0002977 # ReadLines(), -replace...
3.30 00:00:00.0006132 # Get-Content -Head, .Split()...
27.54 00:00:00.0051174 # Import-Csv...
Conclusions:
Calling .ReadLine() twice is marginally faster than the ::ReadLines() loop.
Using -replace instead of Substring() / IndexOf() adds about 60% execution time.
Using Get-Content is more than 3 times slower.
Using Import-Csv | Select-Object is close to 30 times(!) slower, presumably due to the large number of columns; that said, in absolute terms we're still only talking about around 5 milliseconds.
As a side note: execution on macOS seems to be noticeably slower overall, with the regex solution and the cmdlet calls also being slower in relative terms.
Depends what you want to prioritize.
$data = Import-Csv -LiteralPath 'c:\temp\data.csv' |
Select-Object -Skip 1 -First 1 -ExpandProperty 'File_Name'
Is short and convenient. (2nd line meaning 2nd line of the file, or 2nd line of the data? Don't skip any if it's the first line of data).
Select-Object with something like -First 1 will break the whole pipeline when it's done, so it won't wait to read the rest of the 25MB in the background before returning.
You could likely speed it up, or reduce memory use, a miniscule amount if you opened the file, seek'd two newlines, then a comma, then read to another comma, or some other long detailed code, but I very much doubt it would be worth it.
Same with Get-Content, the way it adds NoteProperties to the output strings will mean it's likely no easier on memory and not usefully faster than Import-Csv
You could really shorten it with
(gc c:\file.txt -head 2)[1]
Only reads 2 lines and then grabs index 1 (second line)
You could then split it. And grab index 1 of the split up line
((gc c:\file.txt -head 2)[1]).split(',')[1]
UPDATE:::After seeing the new post with times, I was inspired to do some tests myself (Thanks mklement0). this was the fastest I could get to work
$check = 0
foreach ($i in [IO.FILE]::ReadLines("$filePath")){
if ($check -eq 2){break}
if ($check -eq 1){$value = $i.split(',')[1]} #$value = your answer
$check++
}
Just thought of this: remove if -eq 2 and put break after a semi colon after the check 1 is performed. 5 ticks faster. Haven't tested.
here were my results over 40000 tests:
GC split avg was 1.11307622 Milliseconds
GC split Min was 0.3076 Milliseconds
GC split Max was 18.1514 Milliseconds
ReadLines split avg was 0.3836625825 Milliseconds
ReadLines split Min was 0.2309 Milliseconds
ReadLines split Max was 31.7407 Milliseconds
Stream Reader avg was 0.4464924825 Milliseconds
Stream Reader MIN was 0.2703 Milliseconds
Stream Reader Max was 31.4991 Milliseconds
Import-CSV avg was 1.32440485 Milliseconds
Import-CSV MIN was 0.2875 Milliseconds
Import-CSV Max was 103.1694 Milliseconds
I was able to run 3000 tests a second on the 2nd and 3rd, and 1000 tests a second on the first and last. Stream Reader was HIS fastest one. And import CSV wasn't bad, i wonder if the mklement0 didn't have a column named "file_name" in his test csv? Anyhow, I'd personally use the GC command because it's concise and easy to remember. But this is up to you, and I wish you luck on your scripting adventures.
I'm certain we could start hyperthreading this and get insane results, but when you're talking thousandths of a second is it really a big deal? Especially to get one variable? :D
here's the streamreader code I used for transparency reasons...
$inputFile = "$filePath"
$f = [System.IO.StreamReader]::new($inputFile,$true);
$null = $f.ReadLine(); $line = $f.ReadLine()
$line.Substring(0, $line.IndexOf(','))
$f.Close()
I also noticed this pulls the 1st value of the second line, and I have no idea how to switch it to the 2nd value... it seems to be measuring the width from point 0 to the first comma, and then cutting that. if you change substring from 0 to say 5, it still measures the length of 0 to comma, but then moves where to start grabbing... to the 6th character.
The import-csv I used was :
$data = Import-Csv -LiteralPath "$filePath" |
Select-Object -Skip 1 -First 1 -ExpandProperty 'FileName'
I tested these on a 90 meg csv, with 21 columns, and 284k rows. and "FileName" was the second column

Match select-string of > 11 characters and also starting after a certain point in file

I would like to search through a file (std_serverX.out) for a value of string cpu= that is 11 characters or greater. This file can contain anywhere up to or exceeding 1 Million lines.
To restrict the search further, I would like the search for cpu= to start after the first occurrence of the string Java Thread Dump has been found.
In my source file, the string Java Thread Dump does not begin until approximately the line # 1013169, of a file 1057465 lines long, so therefore 96% of what precedes Java Thread Dump is unnecessary..
Here is a section of the file that I would like to search:
cpu=191362359.38 [reset 191362359.38] ms elapsed=1288865.05 [reset 1288865.05] s allocated=86688238148864 B (78.84 TB) [reset 86688238148864 B (78.84 TB)] defined_classes=468
io= file i/o: 588014/275091 B, net i/o: 36449/41265 B, files opened:19, socks opened:0 [reset file i/o: 588014/275091 B, net i/o: 36449/41265 B, files opened:19, socks opened:0 ]
user="Guest" application="JavaEE/ResetPassword" tid=0x0000000047a8b000 nid=0x1b10 / 6928 runnable [_thread_blocked (_call_back), stack(0x0000000070de0000,0x0000000070fe0000)] [0x0000000070fdd000] java.lang.Thread.State: RUNNABLE
Above, you can see that cpu=191362359.38 is 12 characters long (including full stop and 2 decimal places). How do I match it so that values of cpu= smaller than 11 characters are ignored and not printed to file?
Here is what I have so far:
Get-Content -Path .\std_server*.out | Select-String '(cpu=)' | out-File -width 1024 .\output.txt
I have stripped my command down to its absolute basics so I do not get confused by other search requirements.
Also, I want this command to be as basic as possible that it can be run in one command-line in Powershell, if possible. So no advanced scripts or defined variables, if we can avoid it... :)
This is related to a previous message I opened which got complicated by my not defining precisely my requirements.
Thanks in advance for your help.
Antóin
regex to look for 9 digits followed by a literal . followed by 1 or more digits. all one line
Get-Content -Path .\std_server*.out |
Select-String -Pattern 'cpu=\d{9}\.\d+' -AllMatches |
Select-Object -ExpandProperty matches |
Select-Object -ExpandProperty value
It can certainly be done, but piping a million lines, the first 96% of which you know has no relevance is not going to be very fast/efficient.
A faster approach would be to use a StreamReader and just skip over the lines until the Java Thread Dump string is found:
$CPULines = #()
foreach($file in Get-Item .\std_server*.out)
{
# Create stream reader from file
$Reader = New-Object -TypeName 'System.IO.StreamReader' -ArgumentList $file.FullName
$JTDFound = $false
# Read file line by line
while(($line = $Reader.ReadLine()))
{
# Keep looking until 'Java Thread Dump' is found
if(-not $JTDFound)
{
$JTDFound = $line.Contains('Java Thread Dump')
}
else
{
# Then, if a value matching your description is found, add that line to our results
if($line -match '^cpu=([\d\.]{11,})\s')
{
$CPULines += $line
}
}
}
# dispose of the stream reader
$Reader.Dispose()
}
# Write output to file
$CPULines |Out-File .\output.txt

Understanding performance impact of "write-output"

I'm writing a Powershell script (PS version 4) to parse and process IIS log files, and I've come across an issue I don't quite understand: write-output seems to add significant processing time to the script. The core of it is this (there is more, but this demonstrates the issue):
$file = $args[0]
$linecount = 0
$start = [DateTime]::Now
$reader = [IO.File]::OpenText($file)
while ($reader.Peek() -ge 0) {
$line = $reader.ReadLine()
$linecount++
if (0 -eq ($linecount % 10000)) {
$batch = [DateTime]::Now
[Console]::Error.WriteLine(" Processed $linecount lines ($file) in $(($batch - $start).TotalMilliseconds)ms")
$start = $batch
}
$parts = $line.split(' ')
$out = "$file,$($parts[0]) $($parts[1]),$($parts[2]),$($parts[3]),$($parts[4]),$($parts[5]),$($parts[6]),$($parts[7])"
## Send the output out - comment in/out the desired output method
## Direct output - roughly 10,000 lines / 880ms
$out
## Via write-output - roughly 10,000 lines / 1500ms
write-output $out
}
$reader.Close()
Invoked as .\script.ps1 {path_to_340,000_line_IIS_log} > bob.txt; progress/performance timings are given on stderr.
The script above has two output lines - the write-output one is reporting 10,000 lines every 1500ms, whereas the line that does not have write-output takes as good as half that, averaging about 880ms per 10,000 lines.
I thought that an object defaulted to write-output if it had no other thing (i.e., I thought that "bob" was equivalent to write-output "bob"), but the times I'm getting argue against this.
What am I missing here?
Just a guess, but:
Looking at the help on write-output
Write-Output [-InputObject] <PSObject[]> [-NoEnumerate] [<CommonParameters>]
You're giving it an list of objects as an argument, so it's having to spend a little time assembling to them into an array internally before it does the write, whereas simply outputting them just streams them to the pipeline immediately. You could pipe them to Write-Object, but that's going to add another pipeline which might be even worse.
Edit
In addition you'll find that it's adding .062ms per operation (1500 -880)/10000. You have to scale that to very large data sets before it becomes noticeable.

Powershell 2 and .NET: Optimize for extremely large hash tables?

I am dabbling in Powershell and completely new to .NET.
I am running a PS script that starts with an empty hash table. The hash table will grow to at least 15,000 to 20,000 entries. Keys of the hash table will be email addresses in string form, and values will be booleans. (I simply need to track whether or not I've seen an email address.)
So far, I've been growing the hash table one entry at a time. I check to make sure the key-value pair doesn't already exist (PS will error on this condition), then I add the pair.
Here's the portion of my code we're talking about:
...
if ($ALL_AD_CONTACTS[$emailString] -ne $true) {
$ALL_AD_CONTACTS += #{$emailString = $true}
}
...
I am wondering if there is anything one can do from a PowerShell or .NET standpoint that will optimize the performance of this hash table if you KNOW it's going to be huge ahead of time, like 15,000 to 20,000 entries or beyond.
Thanks!
I performed some basic tests using Measure-Command, using a set of 20 000 random words.
The individual results are shown below, but in summary it appears that adding to one hashtable by first allocating a new hashtable with a single entry is incredibly inefficient :) Although there were some minor efficiency gains among options 2 through 5, in general they all performed about the same.
If I were to choose, I might lean toward option 5 for its simplicity (just a single Add call per string), but all the alternatives I tested seem viable.
$chars = [char[]]('a'[0]..'z'[0])
$words = 1..20KB | foreach {
$count = Get-Random -Minimum 15 -Maximum 35
-join (Get-Random $chars -Count $count)
}
# 1) Original, adding to hashtable with "+=".
# TotalSeconds: ~800
Measure-Command {
$h = #{}
$words | foreach { if( $h[$_] -ne $true ) { $h += #{ $_ = $true } } }
}
# 2) Using sharding among sixteen hashtables.
# TotalSeconds: ~3
Measure-Command {
[hashtable[]]$hs = 1..16 | foreach { #{} }
$words | foreach {
$h = $hs[$_.GetHashCode() % 16]
if( -not $h.ContainsKey( $_ ) ) { $h.Add( $_, $null ) }
}
}
# 3) Using ContainsKey and Add on a single hashtable.
# TotalSeconds: ~3
Measure-Command {
$h = #{}
$words | foreach { if( -not $h.ContainsKey( $_ ) ) { $h.Add( $_, $null ) } }
}
# 4) Using ContainsKey and Add on a hashtable constructed with capacity.
# TotalSeconds: ~3
Measure-Command {
$h = New-Object Collections.Hashtable( 21KB )
$words | foreach { if( -not $h.ContainsKey( $_ ) ) { $h.Add( $_, $null ) } }
}
# 5) Using HashSet<string> and Add.
# TotalSeconds: ~3
Measure-Command {
$h = New-Object Collections.Generic.HashSet[string]
$words | foreach { $null = $h.Add( $_ ) }
}
So it's a few weeks later, and I wasn't able to come up with the perfect solution. A friend at Google suggested splitting the hash into several smaller hashes. He suggested that each time I went to look up a key, I'd have several misses until I found the right "bucket", but he said the read penalty wouldn't be nearly as bad as the write penalty when the collision algorithm ran to insert entries into the (already giant) hash table.
I took this idea and took it one step further. I split the hash into 16 smaller buckets. When inserting an email address as a key into the data structures, I actually first compute a hash on the email address itself, and do a mod 16 operation to get a consistent value between 0 and 15. I then use that calculated value as the "bucket" number.
So instead of using one giant hash, I actually have a 16-element array, whose elements are hash tables of email addresses.
The total speed it takes to build the in-memory representation of my "master list" of 20,000+ email addresses, using split-up hash table buckets, is now roughly 1,000% faster. (10 times faster).
Accessing all of the data in the hashes has no noticeable speed delays. This is the best solution I've been able to come up with so far. It's slightly ugly, but the performance improvement speaks for itself.
You're going to spend a lot of the CPU time re-allocating the internal 'arrays' in the Hashtable. Have you tried the .NET constructor for Hashtable that takes a capacity?
$t = New-Object Hashtable 20000
...
if (!($t.ContainsKey($emailString))) {
$t.Add($emailString, $emailString)
}
My version uses the same $emailString for the key & value, no .NET boxing of $true to an [object] just as a placeholder. The non-null string will evaluate to $true in PowerShell 'if' conditionals, so other code where you check shouldn't change. Your use of '+= #{...}' would be a big no-no in performance sensitive .NET code. You might be allocating a new Hashtable per email just by using the '#{}' syntax, which could be wasting a lot of time.
Your approach of breaking up the very large collection into a (relatively small) number of smaller collections is called 'sharding'. You should use the Hashtable constructor that takes a capacity even if you're sharding by 16.
Also, #Larold is right, if you're not looking up the email addresses, then use 'New-Object ArrayList 20000' to create a pre-allocated list.
Also, the collections grow expenentially (factor of 1.5 or 2 on each 'growth'). The effect of this is that you should be able to reduce how much you pre-allocate by an order of manitude, and if the collections resize once or twice per 'data load' you probably won't notice. I would bet it is the first 10-20 generations of 'growth' that is taking time.