Powershell 2 and .NET: Optimize for extremely large hash tables? - powershell

I am dabbling in Powershell and completely new to .NET.
I am running a PS script that starts with an empty hash table. The hash table will grow to at least 15,000 to 20,000 entries. Keys of the hash table will be email addresses in string form, and values will be booleans. (I simply need to track whether or not I've seen an email address.)
So far, I've been growing the hash table one entry at a time. I check to make sure the key-value pair doesn't already exist (PS will error on this condition), then I add the pair.
Here's the portion of my code we're talking about:
...
if ($ALL_AD_CONTACTS[$emailString] -ne $true) {
$ALL_AD_CONTACTS += #{$emailString = $true}
}
...
I am wondering if there is anything one can do from a PowerShell or .NET standpoint that will optimize the performance of this hash table if you KNOW it's going to be huge ahead of time, like 15,000 to 20,000 entries or beyond.
Thanks!

I performed some basic tests using Measure-Command, using a set of 20 000 random words.
The individual results are shown below, but in summary it appears that adding to one hashtable by first allocating a new hashtable with a single entry is incredibly inefficient :) Although there were some minor efficiency gains among options 2 through 5, in general they all performed about the same.
If I were to choose, I might lean toward option 5 for its simplicity (just a single Add call per string), but all the alternatives I tested seem viable.
$chars = [char[]]('a'[0]..'z'[0])
$words = 1..20KB | foreach {
$count = Get-Random -Minimum 15 -Maximum 35
-join (Get-Random $chars -Count $count)
}
# 1) Original, adding to hashtable with "+=".
# TotalSeconds: ~800
Measure-Command {
$h = #{}
$words | foreach { if( $h[$_] -ne $true ) { $h += #{ $_ = $true } } }
}
# 2) Using sharding among sixteen hashtables.
# TotalSeconds: ~3
Measure-Command {
[hashtable[]]$hs = 1..16 | foreach { #{} }
$words | foreach {
$h = $hs[$_.GetHashCode() % 16]
if( -not $h.ContainsKey( $_ ) ) { $h.Add( $_, $null ) }
}
}
# 3) Using ContainsKey and Add on a single hashtable.
# TotalSeconds: ~3
Measure-Command {
$h = #{}
$words | foreach { if( -not $h.ContainsKey( $_ ) ) { $h.Add( $_, $null ) } }
}
# 4) Using ContainsKey and Add on a hashtable constructed with capacity.
# TotalSeconds: ~3
Measure-Command {
$h = New-Object Collections.Hashtable( 21KB )
$words | foreach { if( -not $h.ContainsKey( $_ ) ) { $h.Add( $_, $null ) } }
}
# 5) Using HashSet<string> and Add.
# TotalSeconds: ~3
Measure-Command {
$h = New-Object Collections.Generic.HashSet[string]
$words | foreach { $null = $h.Add( $_ ) }
}

So it's a few weeks later, and I wasn't able to come up with the perfect solution. A friend at Google suggested splitting the hash into several smaller hashes. He suggested that each time I went to look up a key, I'd have several misses until I found the right "bucket", but he said the read penalty wouldn't be nearly as bad as the write penalty when the collision algorithm ran to insert entries into the (already giant) hash table.
I took this idea and took it one step further. I split the hash into 16 smaller buckets. When inserting an email address as a key into the data structures, I actually first compute a hash on the email address itself, and do a mod 16 operation to get a consistent value between 0 and 15. I then use that calculated value as the "bucket" number.
So instead of using one giant hash, I actually have a 16-element array, whose elements are hash tables of email addresses.
The total speed it takes to build the in-memory representation of my "master list" of 20,000+ email addresses, using split-up hash table buckets, is now roughly 1,000% faster. (10 times faster).
Accessing all of the data in the hashes has no noticeable speed delays. This is the best solution I've been able to come up with so far. It's slightly ugly, but the performance improvement speaks for itself.

You're going to spend a lot of the CPU time re-allocating the internal 'arrays' in the Hashtable. Have you tried the .NET constructor for Hashtable that takes a capacity?
$t = New-Object Hashtable 20000
...
if (!($t.ContainsKey($emailString))) {
$t.Add($emailString, $emailString)
}
My version uses the same $emailString for the key & value, no .NET boxing of $true to an [object] just as a placeholder. The non-null string will evaluate to $true in PowerShell 'if' conditionals, so other code where you check shouldn't change. Your use of '+= #{...}' would be a big no-no in performance sensitive .NET code. You might be allocating a new Hashtable per email just by using the '#{}' syntax, which could be wasting a lot of time.
Your approach of breaking up the very large collection into a (relatively small) number of smaller collections is called 'sharding'. You should use the Hashtable constructor that takes a capacity even if you're sharding by 16.
Also, #Larold is right, if you're not looking up the email addresses, then use 'New-Object ArrayList 20000' to create a pre-allocated list.
Also, the collections grow expenentially (factor of 1.5 or 2 on each 'growth'). The effect of this is that you should be able to reduce how much you pre-allocate by an order of manitude, and if the collections resize once or twice per 'data load' you probably won't notice. I would bet it is the first 10-20 generations of 'growth' that is taking time.

Related

Check if a condition is met by a line within a TXT but "in an advanced way"

I have a TXT file with 1300 megabytes (huge thing). I want to build code that does two things:
Every line contains a unique ID at the beginning. I want to check for all lines with the same unique ID if the conditions is met for that "group" of IDs. (This answers me: For how many lines with the unique ID X have all conditions been met)
If the script is finished I want to remove all lines from the TXT where the condition was met (see 2). So I can rerun the script with another condition set to "narrow down" the whole document.
After few cycles I finally have a set of conditions that applies to all lines in the document.
It seems that my current approach is very slow.( one cycle needs hours). My final result is a set of conditions that apply to all lines of code.
If you find an easier way to do that, feel free to recommend.
Help is welcome :)
Code so far (does not fullfill everything from 1&2)
foreach ($item in $liste)
{
# Check Conditions
if ( ($item -like "*XXX*") -and ($item -like "*YYY*") -and ($item -notlike "*ZZZ*")) {
# Add a line to a document to see which lines match condition
Add-Content "C:\Desktop\it_seems_to_match.txt" "$item"
# Retrieve the unique ID from the line and feed array.
$array += $item.Split("/")[1]
# Remove the line from final document
$liste = $liste -replace $item, ""
}
}
# Pipe the "new cleaned" list somewhere
$liste | Set-Content -Path "C:\NewListToWorkWith.txt"
# Show me the counts
$array | group | % { $h = #{} } { $h[$_.Name] = $_.Count } { $h } | Out-File "C:\Desktop\count.txt"
Demo Lines:
images/STRINGA/2XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg images/STRINGA/3XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg images/STRINGB/4XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg images/STRINGB/5XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg images/STRINGC/5XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg
performance considerations:
Add-Content "C:\Desktop\it_seems_to_match.txt" "$item"
try to avoid wrapping cmdlet pipelines
See also: Mastering the (steppable) pipeline
$array += $item.Split("/")[1]
Try to avoid using the increase assignment operator (+=) to create a collection
See also: Why should I avoid using the increase assignment operator (+=) to create a collection
$liste = $liste -replace $item, ""
This is a very expensive operation considering that you are reassigning (copying) a long list ($liste) with each iteration.
Besides it is a bad practice to change an array that you are currently iterating.
$array | group | ...
Group-Object is a rather slow cmdlet, you better collect (or count) the items on-the-fly (where you do $array += $item.Split("/")[1]) using a hashtable, something like:
$Name = $item.Split("/")[1]
if (!$HashTable.Contains($Name)) { $HashTable[$Name] = [Collections.Generic.List[String]]::new() }
$HashTable[$Name].Add($Item)
To minimize memory usage it may be better to read one line at a time and check if it already exists. Below code I used StringReader and you can replace with StreamReader for reading from a file. I'm checking if the entire string exists, but you may want to split the line. Notice I have duplicaes in the input but not in the dictionary. See code below :
$rows= #"
images/STRINGA/2XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg
images/STRINGA/3XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg
images/STRINGB/4XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg
images/STRINGB/5XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg
images/STRINGC/5XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg
images/STRINGA/2XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg
images/STRINGA/3XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg
images/STRINGB/4XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg
images/STRINGB/5XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg
images/STRINGC/5XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg
"#
$dict = [System.Collections.Generic.Dictionary[int, System.Collections.Generic.List[string]]]::new();
$reader = [System.IO.StringReader]::new($rows)
while(($row = $reader.ReadLine()) -ne $null)
{
$hash = $row.GetHashCode()
if($dict.ContainsKey($hash))
{
#check if list contains the string
if($dict[$hash].Contains($row))
{
#string is a duplicate
}
else
{
#add string to dictionary value if it is not in list
$list = $dict[$hash].Value
$list.Add($row)
}
}
else
{
#add new hash value to dictionary
$list = [System.Collections.Generic.List[string]]::new();
$list.Add($row)
$dict.Add($hash, $list)
}
}
$dict

Enumerating large powershell object variable (1 million plus members)

I'm processing large amounts of data and after pulling the data and manipulating it, I have the results stored in memory in a variable.
I now need to separate this data into separate variables and this was easily done via piping and using a where-object, but this has slowed down now that I have much more data (1 million plus members). Note: it takes about 5+ minutes.
$DCEntries = $DNSQueries | ? {$_.ClientIP -in $DCs.ipv4address -Or $_.ClientIP -eq '127.0.0.1'}
$NonDCEntries = $DNSQueries | ? {$_.ClientIP -notin $DCs.ipv4address -And $_.ClientIP -ne '127.0.0.1'}
#Note:
#$DCs is an array of 60 objects of type Microsoft.ActiveDirectory.Management.ADDomainController, with two properties: Name, ipv4address
#$DNSQueries is a collection of pscustomobjects that has 6 properties, all strings.
I immediately realize I'm enumerating $DNSQueries (the large object) twice, which is obviously costing me some time. As such I decided to go about this a different way enumerating it once and using a Switch statement, but this seems to have exponentially caused the timing to INCREASE, which is not what I was going for.
$DNSQueries | ForEach-Object {
Switch ($_) {
{$_.ClientIP -in $DCs.ipv4address -Or $_.ClientIP -eq '127.0.0.1'} {
# Query is from a DC
$DCEntries += $_
}
default {
# Query is not from DC
$NonDCEntries += $_
}
}
}
I'm wondering if someone can explain to me why the second code takes so much more time. Further, perhaps offer a better way to accomplish what I want.
Is the Foreach-Object and/or appending of the sub variables costing that much time?
ForEach-Object is actually the slowest way to enumerate a collection but also there is a follow-up switch with a script block condition causing even more overhead.
If the collection is already in memory, nothing can beat a foreach loop for linear enumeration.
As for your biggest problem, the use of += to add elements to an array and it being a fixed size collection. PowerShell has to create a new array and copy all elements to a new array each time a new element is added, this causes an extremely high amount of overhead. See this answer as well as this awesome documention for more details.
In this case you can combine a Collections.Generic.List<T> with PowerShell's explicit assignment.
$NonDCEntries = [Collections.Generic.List[object]]::new()
$DCEntries = foreach($item in $DNSQueries) {
if($item.ClientIP -in $DCs.IPv4Address -Or $_.ClientIP -eq '127.0.0.1') {
$item
continue
}
$NonDCEntries.Add($item)
}
To put into perspective how exponentially bad += to an array is, you can test this code:
$Tests = [ordered]#{
'PowerShell Explicit Assignment' = {
$result = foreach($i in 1..$count) {
$i
}
}
'+= Operator to System.Array' = {
$result = #( )
foreach($i in 1..$count) {
$result += $i
}
}
'.Add(..) to List<T>' = {
$result = [Collections.Generic.List[int]]::new()
foreach($i in 1..$count) {
$result.Add($i)
}
}
}
foreach($count in 1000, 10000, 100000) {
foreach($test in $Tests.GetEnumerator()) {
$measurement = (Measure-Command { & $test.Value }).TotalMilliseconds
$totalRound = [math]::Round($measurement, 2).ToString() + ' ms'
[pscustomobject]#{
CollectionSize = $count
Test = $test.Key
TotalMilliseconds = $totalRound
}
}
}
Which in my laptop yields the following results:
CollectionSize Test TotalMilliseconds
-------------- ---- -----------------
1000 PowerShell Explicit Assignment 15.9 ms
1000 += Operator to System.Array 26.88 ms
1000 .Add(..) to List<T> 12.47 ms
10000 PowerShell Explicit Assignment 1.07 ms
10000 += Operator to System.Array 2488.24 ms
10000 .Add(..) to List<T> 0.9 ms
100000 PowerShell Explicit Assignment 16.07 ms
100000 += Operator to System.Array 308931.8 ms
100000 .Add(..) to List<T> 8.39 ms

Powershell script. More efficient way to perform nested foreach loops? [duplicate]

This question already has answers here:
In PowerShell, what's the best way to join two tables into one?
(5 answers)
Closed 5 months ago.
Good day.
I wrote a script that imports Excel files and then compares the rows. Each file contains about 13K rows. It is taking about 3 hours to process, which seems too long. This is happening because I am looping through every 13K rows from fileb for each row in filea.
Is there a more efficient way to do this?
Here is sample code:
#Import rows as customObject
rowsa = Import-Excel $filea
rowsb = Import-Excel $fileb
#Loop through each filea rows
foreach ($rowa in $rowsa)
{
#Loop through each fileb row. If the upc code matches rowa, check if other fields match
foreach ($rowb in $rowsb)
{
$rowb | Where-Object -Property "UPC Code" -Like $rowa.upc |
Foreach-Object {
if (( $rowa.uom2 -eq 'INP') -and ( $rowb.'Split Quantity' -ne $rowa.qty1in2 ))
{
#Do Something
}
}
}
Seems like you can leverage Group-Object -AsHashtable for this. See about Hash Tables for more info on why this should be faster.
$mapB = Import-Excel $fileb | Group-Object 'UPC Code' -AsHashTable -AsString
foreach($row in Import-Excel $filea) {
if($mapB.ContainsKey($row.upc)) {
$value = $mapB[$row.upc]
if($row.uom2 -eq 'INP' -and $row.qty1in2 -ne $value.'Split Quantity') {
$value # => has the row matching on UPC (FileA) / UPC Code (FileB)
$row # => current row in FileA
}
}
}
A few tricks:
The Object Pipeline may be easy, but it's not as fast as a statement
Try changing your code use to foreach statements, not Where-Object and Foreach-Object.
Use Hashtables to group.
While you can use Group-Object to do this, Group-Object suffers from the same performance problems as anything else in the pipeline.
Try to limit looping within looping.
As a general rule, looping within looping will be o(n^2). If you can avoid loops within loops, this is great. So switching the code around to loop thru A, then loop thru B, will be more efficient. So will exiting your loops as quickly as possible.
Consider using a benchmarking tool
There's a little module I make called Benchpress that can help you test multiple approaches to see which is faster. The Benchpress docs have a number of general PowerShell performance benchmarks to help you determine the fastest way to script a given thing.
Updated Script Below:
#Import rows as customObject
$rowsa = Import-Excel $filea
$rowsb = Import-Excel $fileb
$rowAByUPC = #{}
foreach ($rowA in $rowsa) {
# This assumes there will only be one row per UPC.
# If there is more than one, you may need to make a list here instead
$rowAByUPC[$rowA.UPC] = $rowA
}
foreach ($rowB in $rowsB) {
# Skip any rows in B that don't have a UPC code.
$rowBUPC = $rowB."UPC Code"
if (-not $rowBUPC) { continue }
$RowA = $rowAByUPC[$rowBUPC]
# It seems only rows that have 'INP' in uom2 are important
# so continue if missing
if ($rowA.uom2 -ne 'INP') { continue }
if ($rowA.qty1in2 -ne $rowB.'Split Quantity') {
# Do what you came here to do.
}
}
Please note that as you have not shared the code within the matching condition, you may need to take the advice contained in this answer and apply it to the inner code.

Powershell create new array by subtracting from the original array

I have a large data array (Invoke-Sqlcmd export, array variable), lets say it contains all of our 'customers'.
I have large data list (Local search results, string variable), lets say it contains all of our 'paying customers'
I want to create a final array by subtracting the people in the list from the array, leaving me with 'customers who have not paid yet'. The code is doing exactly this but runs very slow, Im hoping to educate myeself on a quicker way.
The size of the first array is around 25,000 and the size of the list is around 20,000. Because the list is just a plain text variable, I first split it on new lines to allow the ForEach to take place (otherwise it registers as one object) The code im using to do this is:
$NewArray = $FirstArray
ForEach ($Customer In $List)
{
$NewArray = $NewArray | ? {$_.CustomerID -ne $Customer}
}
any help greatly appreciated, Thanks
I've done a bit of performance testing, and by far and away the fastest approach is #zett42's suggestion of converting the customer list into a hashtable.
Test Data
First, here's a little snippet to set up some test data. It has some unrealistic properties like being sorted by id, with the first 80% being paid customers and the last 20% unpaid, but I don't think that will affect the results much.
# build some test data
$count = 25000
$allCustomers = #(
1..$count | foreach-object {
[pscustomobject] [ordered] #{
"CustomerID" = $_
}
}
);
$paidIds = [int[]] #( 1..($count * 0.8) )
Original Approach
Here's the OP's approach as a baseline:
Measure-Command {
$NewArray = #( $allCustomers );
foreach( $paidId In $paidIds )
{
$NewArray = $NewArray | where-object { $_.CustomerID -ne $paidId }
}
}
# TotalMilliseconds : 37732.532
Hashtable
And here's #zett42's suggestion of using a hashtable to do fast removal of customers by CustomerId when we find them in the paid list:
Measure-Command {
# convert customers into a hashtable for fast removal by id
$lookups = #{};
foreach( $customer in $allCustomers )
{
$lookups.Add($customer.CustomerId, $customer)
}
# remove all the paid customer ids
foreach( $paidId In $paidIds )
{
$lookups.Remove($paidId)
}
# get the remaining unpaid customers
$NewArray = #( $lookups.Values )
}
# TotalMilliseconds : 53.9363
-notin
#AbrahamZinala's suggestion of -notin also works pretty well if you want a quick one-liner and don't mind not having the raw blazing speed of the hashtable approach.
Measure-Command {
$NewArray = $allCustomers | where-object { $_.CustomerId -notin $paidIds };
}
# TotalMilliseconds : 698.2812
The hashtable approach scales better with larger datasets, so if you don't mind the little bit of extra setup that might be the one to go for...

PowerShell script to group records by overlapping start and end date

I am working on a CSV file which have start and end date and the requirement is group records by dates when the dates overlap each other.
For example, in below table Bill_Number 177835 Start_Date and End_Date is overlapping with 178682,179504, 178990 Start_Date and End_Date so all should be grouped together and so on for each and every record.
Bill_Number,Start_Date,End_Date
177835,4/14/20 3:00 AM,4/14/20 7:00 AM
178682,4/14/20 3:00 AM,4/14/20 7:00 AM
179504,4/14/20 3:29 AM,4/14/20 6:29 AM
178662,4/14/20 4:30 AM,4/14/20 5:30 AM
178990,4/14/20 6:00 AM,4/14/20 10:00 AM
178995,4/15/20 6:00 AM,4/15/20 10:00 AM
178998,4/15/20 6:00 AM,4/15/20 10:00 AM
I have tried different combination like "Group-by" and "for loop" but not able to produce result.
With the above example of CSV, the expected result is;
Group1: 177835,178682,179504, 178990
Group2: 177835,178682,179504, 178662
Group3: 178995, 178998
Currently i have below code in hand.
Any help on this will be appreciated,thanks in advance.
$array = #(‘ab’,’bc’,’cd’,’df’)
for ($y = 0; $y -lt $array.count) {
for ($x = 0; $x -lt $array.count) {
if ($array[$y]-ne $array[$x]){
Write-Host $array[$y],$array[$x]
}
$x++
}
$y++
}
You can do something like the following. There is likely a cleaner solution, but that could take a lot of time.
$csv = Import-Csv file.csv
# Creates all inclusive groups where times overlap
$csvGroups = foreach ($row in $csv) {
$start = [datetime]$row.Start_Date
$end = [datetime]$row.End_Date
,($csv | where { ($start -ge [datetime]$_.Start_Date -and $start -le [datetime]$_.End_Date) -or ($end -ge [datetime]$_.Start_Date -and $end -le [datetime]$_.End_Date) })
}
# Removes duplicates from $csvGroups
$groups = $csvGroups | Group {$_.Bill_number -join ','} |
Foreach-Object { ,$_.Group[0] }
# Compares current group against all groups except itself
$output = for ($i = 0; $i -lt $groups.count; $i++) {
$unique = $true # indicates if the group's bill_numbers are in another group
$group = $groups[$i]
$list = $groups -as [system.collections.arraylist]
$list.RemoveAt($i) # Removes self
foreach ($innerGroup in $list) {
# If current group's bill_numbers are in another group, skip to next group
if ((compare $group.Bill_Number $innergroup.Bill_Number).SideIndicator -notcontains '<=') {
$unique = $false
break
}
}
if ($unique) {
,$group
}
}
$groupCounter = 1
# Output formatting
$output | Foreach-Object { "Group{0}:{1}" -f $groupCounter++,($_.Bill_Number -join ",")}
Explanation:
I added comments to give an idea as to what is going on.
The ,$variable syntax uses the unary operator ,. It converts the output into an array. Typically, PowerShell unrolls an array as individual items. The unrolling becomes a problem here because we want the groups to stays as groups (arrays). Otherwise, there would be a lot of duplicate bill numbers, and we'd lose track between groups.
An arraylist is used for $list. This is so we can access the RemoveAt() method. A typical array is of fixed size and can't be manipulated in that fashion. This can effectively be done with an array, but the code is different. You either have to select the index ranges around the item you want to skip or create a new array using some other conditional statement that will exclude the target item. An arraylist is just easier for me (personal preference).
So a very dirty approach. I think there are a coup of ways to determine if there's overlap for a specific comparison, one record to another. However you may need a list of bill numbers each bill date range collides with. using a function call in a Select-Object statement/expression I added a collisions property to your objects.
The function is wordy and probably be improved, but the gist is that for each record it will compare to all other records and report that bill number in it's collision property if either the start or end date falls within the other records range.
This is of course just demo code, I'm sure it can be made better for your purposes, but may be a starting point for you.
Obviously change the path to the CSV file.
Function Get-Collisions
{
Param(
[Parameter(Mandatory = $true)]
[Object]$ReferenceObject,
[Parameter( Mandatory = $true )]
[Object[]]$CompareObjects
) # End Parameter Block
ForEach($Object in $CompareObjects)
{
If( !($ReferenceObject.Bill_Number -eq $Object.Bill_Number) )
{
If(
( $ReferenceObject.Start_Date -ge $Objact.StartDate -and $ReferenceObject.Start_Date -le $Objact.End_Date ) -or
( $ReferenceObject.End_Date -ge $Object.Start_Date -and $ReferenceObject.End_Date -le $Object.End_Date ) -or
( $ReferenceObject.Start_Date -le $Object.Start_Date -and $ReferenceObject.End_Date -ge $Object.Start_Date )
)
{
$Object.Bill_Number
}
}
}
} # End Get-Collisions
$Objects = Import-Csv 'C:\temp\DateOverlap.CSV'
$Objects |
ForEach-Object{
$_.Start_Date = [DateTime]$_.Start_Date
$_.End_Date = [DateTime]$_.End_Date
}
$Objects = $Objects |
Select-object *,#{Name = 'Collisions'; Expression = { Get-Collisions -ReferenceObject $_ -CompareObjects $Objects }}
$Objects | Format-Table -AutoSize
Let me know how it goes. Thanks.
#Shan , I saw your comments so I wanted to respond with some additional code and discussion. I may have gone overboard, but you expressed a desire to learn, such that you can maintain these code pieces in the future. So, I put a lot of time into this.
I may mention some of #AdminOfThings work too. That is not criticism, but collaboration. His example is clever and dynamic in terms of getting the job done and pulling in the right tools as he worked his way to the desired output.
I originally side-stepped the grouping question because I didn't feel like naming/numbering the groups had any meaning. For example: "Group 1" indicates all its members have overlap in their billing periods, but no indication of what or when the overlap is. Maybe I rushed through it… I may have been reading too much into it or perhaps even letting my own biases get in the way. At any rate, I elected to create a relationship from the perspective of each bill number, and that resulted in my first answer.
Since then, and because of your comment, I put effort into extending and documenting the first example I gave. The revised code will be Example 1 below. I've heavily commented it and most of the comments will apply to the original example as well. There are some differences that were forced by the extended grouping functionality, but the comments should reflect those situations.
Note: You'll also see I stopped calling them "collisions" and termed them "overlaps" instead.
Example 1:
Function Get-Overlaps
{
<#
.SYNOPSIS
Given an object (reference object) compare to a collection of other objects of the same
type. Return an array of billing numbers for which the billing period overlaps that of
the reference object.
.DESCRIPTION
Given an object (reference object) compare to a collection of other objects of the same
type. Return an array of billing numbers for which the billing period overlaps that of
the reference object.
.PARAMETER ReferenceObject
This is the current object you wish to compare to all other objects.
.PARAMETER
The collection of objects you want to compare with the reference object.
.NOTES
> The date time casting could probably have been done by further preparing
the objects in the calling code. However, givin this is for a
StackOverflow question I can polish that later.
#>
Param(
[Parameter(Mandatory = $true)]
[Object]$ReferenceObject,
[Parameter( Mandatory = $true )]
[Object[]]$CompareObjects
) # End Parameter Block
[Collections.ArrayList]$Return = #()
$R_StartDate = [DateTime]$ReferenceObject.Start_Date
$R_EndDate = [DateTime]$ReferenceObject.End_Date
ForEach($Object in $CompareObjects)
{
$O_StartDate = [DateTime]$Object.Start_Date
$O_EndDate = [DateTime]$Object.End_Date
# The first if statement skips the reference object's bill_number
If( !($ReferenceObject.Bill_Number -eq $Object.Bill_Number) )
{
# This logic can use some explaining. So far as I could tell there were 2 cases to look for:
# 1) Either or both the start and end dates fell inside the the timespan of the comparison
# object. This cases is handle by the first 2 conditions.
# 2) If the reference objects timespan covers the entire timespan of the comparison object.
# Meaning the start date is before and the end date is after, fitting the entire
# comparison timespan is within the bounds of the reference timespan. I elected to use
# the 3rd condition below to detect that case because once the start date is earlier I
# only have to care if the end date is greater than the start date. It's a little more
# inclusive and partially covered by the previous conditions, but whatever, you gotta
# pick something...
#
# Note: This was a deceptively difficult thing to comprehend, I missed that last condition
# in my first example (later corrected) and I think #AdminOfThings also overlooked it.
If(
( $R_StartDate -ge $O_StartDate -and $R_StartDate -le $O_EndDate ) -or
( $R_EndDate -ge $O_StartDate -and $R_EndDate -le $O_EndDate ) -or
( $R_StartDate -le $O_StartDate -and $R_EndDate -ge $O_StartDate )
)
{
[Void]$Return.Add( $Object.Bill_Number )
}
}
}
Return $Return
} # End Get-Overlaps
$Objects =
Import-Csv 'C:\temp\DateOverlap.CSV' |
ForEach-Object{
# Consider overlap as a relationship from the perspective of a given Object.
$Overlaps = [Collections.ArrayList]#(Get-overlaps -ReferenceObject $_ -CompareObjects $Objects)
# Knowing the overlaps I can infer the group, by adding the group's bill_number to its group property.
If( $Overlaps )
{ # Don't calculate a group unless you actually have overlaps:
$Group = $Overlaps.Clone()
[Void]$Group.Add( $_.Bill_Number ) # Can you do in the above line, but for readability I separated it.
}
Else { $Group = $null } # Ensure's not reusing group from a previous iteration of the loop.
# Create a new PSCustomObject with the data so far.
[PSCustomObject][Ordered]#{
Bill_Number = $_.Bill_Number
Start_Date = [DateTime]$_.Start_Date
End_Date = [DateTime]$_.End_Date
Overlaps = $Overlaps
Group = $Group | Sort-Object # Sorting will make it a lot easier to get unique lists later.
}
}
# The reason I recreated the objects from the CSV file instead of using Select-Object as I had
# previously is that I simply couldn't get Select-Object to maintain type ArrayList that was being
# returned from the function. I know that's a documented problem or circumstance some where.
# Now I'll add one more property called Group_ID a comma delimited string that we can later use
# to echo the groups according to your original request.
$Objects =
$Objects |
Select-Object *,#{Name = 'Group_ID'; Expression = { $_.Group -join ', ' } }
# This output is just for the sake of showing the new objects:
$Objects | Format-Table -AutoSize -Wrap
# Now create an array of unique Group_ID strings, this is possible of the sorts and joins done earlier.
$UniqueGroups = $Objects.Group_ID | Select-Object -Unique
$Num = 1
ForEach($UniqueGroup in $UniqueGroups)
{
"Group $Num : $UniqueGroup"
++$Num # Increment the $Num, using convienient unary operator, so next group is echoed properly.
}
# Below is a traditional for loop that does the same thing. I did that first before deciding the ForEach
# was cleaner. Leaving it commented below, because you're on a learning-quest, so just more demo code...
# For($i = 0; $i -lt $UniqueGroups.Count; ++$i)
# {
# $Num = $i + 1
# $UniqueGroup = $UniqueGroups[$i]
# "Group $Num : $UniqueGroup"
# }
Example 2:
$Objects =
Import-Csv 'C:\temp\DateOverlap.CSV' |
Select-Object Bill_Number,
#{ Name = 'Start_Date'; Expression = { [DateTime]$_.Start_Date } },
#{ Name = 'End_Date'; Expression = { [DateTime]$_.End_Date } }
# The above select statement converts the Start_Date & End_Date properties to [DateTime] objects
# While you had asked to pack everything into the nested loops, that would have resulted in
# unnecessary recasting of object types to ensure proper comparison. Often this is a matter of
# preference, but in this case I think it's better. I did have it working well without the
# above select, but the code is more readable / concise with it. So even if you treat the
# Select-Object command as a blackbox the rest of the code should be easier to understand.
#
# Of course, and if you couldn't tell from my samples Select-Object is incredibly useful. I
# recommend taking the time to learn it thoroughly. The MS documentation can be found here:
# https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.utility/select-object?view=powershell-5.1
:Outer ForEach( $ReferenceObject in $Objects )
{
# In other revisions I had assigned these values to some shorter variable names.
# I took that out. Again since you're learning I wanted the all the dot referencing
# to be on full display.
$ReferenceObject.Start_Date = $ReferenceObject.Start_Date
$ReferenceObject.End_Date = $ReferenceObject.End_Date
[Collections.ArrayList]$TempArrList = #() # Reset this on each iteration of the outer loop.
:Inner ForEach( $ComparisonObject in $Objects )
{
If( $ComparisonObject.Bill_Number -eq $ReferenceObject.Bill_Number )
{ # Skip the current reference object in the $Objects collection! This prevents the duplication of
# the current Bill's number within it's group, helping to ensure unique-ification.
#
# By now you should have seen across all revision including AdminOfThings demo, that there was some
# need skip the current item when searching for overlaps. And, that there are a number of ways to
# accomplish that. In this case I simply go back to the top of the loop when the current record
# is encountered, effectively skipping it.
Continue Inner
}
# The below logic needs some explaining. So far as I could tell there were 2 cases to look for:
# 1) Either or both the start and end dates fell inside the the timespan of the comparison
# object. This cases is handle by the first 2 conditions.
# 2) If the reference object's timespan covers the entire timespan of the comparison object.
# Meaning the start date is before and the end date is after, fitting the entire
# comparison timespan is within the bounds of the reference timespan. I elected to use
# the 3rd condition below to detect that case because once the start date is earlier I
# only have to care if the end date is greater than the other start date. It's a little
# more inclusive and partially covered by the previous conditions, but whatever, you gotta
# pick something...
#
# Note: This was a deceptively difficult thing to comprehend, I missed that last condition
# in my first example (later corrected) and I think #AdminOfThings also overlooked it.
If(
( $ReferenceObject.Start_Date -ge $ComparisonObject.Start_Date -and $ReferenceObject.Start_Date -le $ComparisonObject.End_Date ) -or
( $ReferenceObject.End_Date -ge $ComparisonObject.Start_Date -and $ReferenceObject.End_Date -le $ComparisonObject.End_Date ) -or
( $ReferenceObject.Start_Date -le $ComparisonObject.Start_Date -and $ReferenceObject.End_Date -ge $ComparisonObject.Start_Date )
)
{
[Void]$TempArrList.Add( $ComparisonObject.Bill_Number )
}
}
# Now Add the properties!
$ReferenceObject | Add-Member -Name Overlaps -MemberType NoteProperty -Value $TempArrList
If( $ReferenceObject.Overlaps )
{
[Void]$TempArrList.Add($ReferenceObject.Bill_Number)
$ReferenceObject | Add-Member -Name Group -MemberType NoteProperty -Value ( $TempArrList | Sort-Object )
$ReferenceObject | Add-Member -Name Group_ID -MemberType NoteProperty -Value ( $ReferenceObject.Group -join ', ' )
# Below a script property also works, but I think the above is easier to follow:
# $ReferenceObject | Add-Member -Name Group_ID -MemberType ScriptProperty -Value { $this.Group -join ', ' }
}
Else
{
$ReferenceObject | Add-Member -Name Group -MemberType NoteProperty -Value $null
$ReferenceObject | Add-Member -Name Group_ID -MemberType NoteProperty -Value $null
}
}
# This output is just for the sake of showing the new objects:
$Objects | Format-Table -AutoSize -Wrap
# Now create an array of unique Group_ID strings, this is possible of the sorts and joins done earlier.
#
# It's important to point out I chose to sort because I saw the clever solution that AdminOfThings
# used. There's a need to display only groups that have unique memberships, not necessarily unique
# ordering of the members. He identified these by doing some additional loops and using the Compare
# -Object cmdlet. Again, I must say that was very clever, and Compare-Object is another tool very much
# worth getting to know. However, the code didn't seem like it cared which of the various orderings it
# ultimately output. Therefore I could conclude the order wasn't really important, and it's fine if the
# groups are sorted. With the objects sorted it's much easier to derive the truely unique lists with the
# simple Select-Object command below.
$UniqueGroups = $Objects.Group_ID | Select-Object -Unique
# Finally Loop through the UniqueGroups
$Num = 1
ForEach($UniqueGroup in $UniqueGroups)
{
"Group $Num : $UniqueGroup"
++$Num # Increment the $Num, using convienient unary operator, so next group is echoed properly.
}
Additional Discussion:
Hopefully the examples are helpful. I wanted to mentioned a few more points:
Using ArrayLists ( [System.Collections.ArrayList] ) instead of native arrays. The typical reason to do this is the ability to add and remove elements quickly. If you search the internet you'll find hundreds of articles explaining why it's faster. It's so common you'll often find experienced PowerShell users implementing it instinctively. But the main reason is speed and the flexibility to easily add and remove elements.
You'll notice I relied heavily on the ability to append new properties to objects. There are several ways to do this, Select-Object , Creating your own objects, and in Example 2 above I used Get-Member. The main reason I used Get-Member was I couldn't get the ArrayList type to stick when using Select-Object.
Regarding loops. This is specific to your desire for nested loops. My first answer still had loops, except some were implied by the pipe, and others were stored in a helper function. The latter is really also a preference; for readability it's sometimes helpful to park some code out of view from the main code body. That said, all the same concepts were there from the beginning. You should get comfortable with the implied loop that comes with pipe-lining capability.
I don't think there's much more I can say without getting redundant. I really hope this was helpful, it was certainly fun for me to work on it. If you have questions or feedback let me know. Thanks.